Open fivehills opened 2 years ago
I think that computing surprisal should be based on "linguistic token" (natural word) rather than wordpiece (split tokens). (https://explosion.ai/blog/spacy-transformers)
Hi - thanks again for a detailed demonstration. I see there being two problems here:
1) the issue of tokenization: this issue is traced back to the models itself -- the bert tokenizer does not always tokenize all words in a desirable manner, sometimes splitting apart words into their word pieces, as you already know. However, this issue cannot be fixed by minicons as it will require reconfiguring the tokenizer and re-training the model to treat tokens such as "symbolized" as single tokens, so it will have to be done at the model level. But in summary the reason why minicons produces sub-word level surprisals/log-probs is a consequence of the design decisions at the model level as opposed to minicons itself.
2) The second issue is that of actually producing outputs that merge together word-pieces and return per-word log probabilities -- this is a very valid issue and one that minicons may be able to fix, in theory! This will likely require using a separate third-party tokenizer to first split sentences, computing alignments between full words and the sub-words, and then combining log-probabilities for sub-words by summing them. There will definitely be a massive tradeoff with speed here, but in the long-run it might not end up mattering much. Unfortunately, I currently do not have fresh cycles to fully implement this but if you have ideas on how to go about it I would happily review a PR. I will keep this issue open until this feature is added.
I found that some people proposed some ideas on alignment. https://www.lighttag.io/blog/sequence-labeling-with-transformers/example
However, I am not sure whether it is enough to calculate a surprisal value for a natural word given we just combine (i.e, sum up) log-probabilities for it split sub-words by summing them.
Hi,
Minicons seems not tokenize alphabet-based texts given "bert" pre-trained models are introduced. It is desirable to generate the surprisal values for word forms in real life rather than the split forms. For example, "symbolised" is split into ('symbol', 9.485310554504395), ('##ised', 6.920506000518799),. However, I want to get the surprisal value for the real word ("symbolised"). I am not sure how to solve this problem. The package also seems to incorrectly generate surprisal values for some real words, particularly those long words with suffix or prefix, because a long word with prefix or suffix will be split into several units.
Many thanks!