incorrect tokenizers - Githubissues

fivehills commented 2 years ago

Hi,

Minicons seems not tokenize alphabet-based texts given "bert" pre-trained models are introduced. It is desirable to generate the surprisal values for word forms in real life rather than the split forms. For example, "symbolised" is split into ('symbol', 9.485310554504395), ('##ised', 6.920506000518799),. However, I want to get the surprisal value for the real word ("symbolised"). I am not sure how to solve this problem. The package also seems to incorrectly generate surprisal values for some real words, particularly those long words with suffix or prefix, because a long word with prefix or suffix will be split into several units.

Many thanks!


In [13]: model = scorer.MaskedLMScorer('bert-base-multilingual-cased', 'cpu')

In [14]: ge_sen=["Janus symbolisierte häufig Veränderungen und Übergänge, wie de
    ...: n Wechsel von einer Bedingung zur anderen, von einer Perspektive zur an
    ...: deren und das Heranwachsen junger Menschen zum Erwachsenenalter."]

In [15]: model.token_score(ge_sen, surprisal = True, base_two = True)
Out[15]: 
[[('Jan', 7.411351680755615),
  ('##us', 6.953413963317871),
  ('symbol', 8.663262367248535),
  ('##isierte', 8.227853775024414),
  ('häufig', 9.369148254394531),
  ('Veränderungen', 4.863248348236084),
  ('und', 3.3478829860687256),
  ('Über', 3.3023200035095215),
  ('##gänge', 0.40428581833839417),
  (',', 0.048578906804323196),
  ('wie', 1.878091812133789),
  ('den', 5.769808769226074),
  ('Wechsel', 3.2879366874694824),
  ('von', 0.016336975619196892),
  ('einer', 0.016496576368808746),
  ('Bed', 0.0244187843054533),
  ('##ingu', 0.09460146725177765),
  ('##ng', 0.018612651154398918),
  ('zur', 0.9586092829704285),
  ('anderen', 1.2600054740905762),
  (',', 0.3100062906742096),
  ('von', 0.013392632827162743),
  ('einer', 0.025651555508375168),
  ('Pers', 0.007922208867967129),
  ('##pektive', 0.03971010446548462),
  ('zur', 0.8729674220085144),
  ('anderen', 1.6451447010040283),
  ('und', 2.9337639808654785),
  ('das', 0.1244136244058609),
  ('Hera', 1.1853374242782593),
  ('##n', 1.9540393352508545),
  ('##wachsen', 0.006810512859374285),
  ('junge', 2.0289151668548584),
  ('##r', 0.007776367478072643),
  ('Menschen', 3.1449434757232666),
  ('zum', 5.088050365447998),
  ('Er', 0.001235523377545178),
  ('##wachsenen', 0.01289732288569212),
  ('##alter', 0.12524327635765076),
  ('.', 0.02648257650434971)]]

In [16]: en_sen=["Janus often symbolised changes and transitions, such as moving
    ...:  from one condition to another, from one perspective to another, and yo
    ...: ung people growing into adulthood."]

In [17]: en_sen
Out[17]: ['Janus often symbolised changes and transitions, such as moving from one condition to another, from one perspective to another, and young people growing into adulthood.']

In [18]: model.token_score(en_sen, surprisal = True, base_two = True)
Out[18]: 
[[('Jan', 7.161930084228516),
  ('##us', 4.905619144439697),
  ('often', 5.8594160079956055),
  ('symbol', 9.485310554504395),
  ('##ised', 6.920506000518799),
  ('changes', 4.574926853179932),
  ('and', 3.2199747562408447),
  ('transition', 5.44439697265625),
  ('##s', 0.018392512574791908),
  (',', 0.02080027014017105),
  ('such', 0.04780016839504242),
  ('as', 0.013945729471743107),
  ('moving', 7.4285569190979),
  ('from', 0.008073553442955017),
  ('one', 0.2561193108558655),
  ('condition', 18.707305908203125),
  ('to', 0.014606142416596413),
  ('another', 0.8214359283447266),
  (',', 0.7367089986801147),
  ('from', 0.06036728620529175),
  ('one', 0.4734668731689453),
  ('perspective', 13.356915473937988),
  ('to', 0.06987723708152771),
  ('another', 0.7075008749961853),
  (',', 0.08287912607192993),
  ('and', 2.1124203205108643),
  ('young', 6.065392017364502),
  ('people', 3.042752742767334),
  ('growing', 4.334306716918945),
  ('into', 4.379203796386719),
  ('adult', 1.3680847883224487),
  ('##hood', 0.2171218991279602),
  ('.', 0.06372988969087601)]]

In [19]: sp_sen=["Jano suele simbolizar los cambios y las transiciones, como el 
    ...: paso de una condición a otra, de una perspectiva a otra, y el crecimien
    ...: to de los jóvenes hacia la edad adulta."]

In [20]: model.token_score(sp_sen, surprisal = True, base_two = True)
Out[20]: 
[[('Jan', 11.449429512023926),
  ('##o', 7.180861949920654),
  ('suele', 7.2584357261657715),
  ('simbol', 4.928884983062744),
  ('##izar', 0.018150361254811287),
  ('los', 0.03109721466898918),
  ('cambios', 3.5657286643981934),
  ('y', 6.550257682800293),
  ('las', 0.04733512923121452),
  ('trans', 3.946718454360962),
  ('##iciones', 0.35458695888519287),
  (',', 0.0718887448310852),
  ('como', 0.6874077916145325),
  ('el', 0.009603511542081833),
  ('paso', 5.542746067047119),
  ('de', 0.015120714902877808),
  ('una', 0.010806013830006123),
  ('condición', 15.124244689941406),
  ('a', 0.03922305256128311),
  ('otra', 0.5000980496406555),
  (',', 0.5917675495147705),
  ('de', 0.05246984213590622),
  ('una', 0.015467431396245956),
  ('perspectiva', 11.579629898071289),
  ('a', 0.05401906371116638),
  ('otra', 0.39069506525993347),
  (',', 0.024377508088946342),
  ('y', 1.9166929721832275),
  ('el', 0.006273927167057991),
  ('crecimiento', 6.725331783294678),
  ('de', 0.011221524327993393),
  ('los', 0.7993561029434204),
  ('jóvenes', 4.965604305267334),
  ('hacia', 3.6372487545013428),
  ('la', 0.27643802762031555),
  ('edad', 0.262629896402359),
  ('adulta', 0.3033374845981598),
  ('.', 0.03442404791712761)]]

In [21]: ru_sen=["Янус часто символизировал изменения и переходы, такие как пере
    ...: ход от одного состояния к другому, от одной перспективы к другой, а так
    ...: же молодых людей, вступающих во взрослую жизнь."]

In [22]: model.token_score(ru_sen, surprisal = True, base_two = True)
Out[22]: 
[[('Ян', 7.062388896942139),
  ('##ус', 7.699002742767334),
  ('часто', 10.491772651672363),
  ('символ', 1.846983551979065),
  ('##из', 0.5921100974082947),
  ('##ировал', 7.98089599609375),
  ('изменения', 9.341201782226562),
  ('и', 1.0752657651901245),
  ('пер', 0.0009851165814325213),
  ('##еход', 0.024955371394753456),
  ('##ы', 0.4438115358352661),
  (',', 0.036848314106464386),
  ('такие', 1.1838680505752563),
  ('как', 0.00436423160135746),
  ('пер', 0.006749975029379129),
  ('##еход', 0.0007649788167327642),
  ('от', 0.038956135511398315),
  ('одного', 0.11314807087182999),
  ('состояния', 9.765267372131348),
  ('к', 0.005316327791661024),
  ('другому', 1.0975052118301392),
  (',', 0.7671073079109192),
  ('от', 0.021667061373591423),
  ('одной', 1.209750771522522),
  ('пер', 0.016496576368808746),
  ('##спект', 0.001849157502874732),
  ('##ивы', 0.4393042325973511),
  ('к', 0.001108944183215499),
  ('другой', 1.3383814096450806),
  (',', 0.024982888251543045),
  ('а', 0.026218410581350327),
  ('также', 1.0678788423538208),
  ('молодых', 8.310323715209961),
  ('людей', 0.6578267812728882),
  (',', 0.008406511507928371),
  ('в', 0.719817578792572),
  ('##ступ', 0.6718330383300781),
  ('##ающих', 0.12413019686937332),
  ('во', 0.23966126143932343),
  ('в', 0.04291310906410217),
  ('##з', 0.00735535379499197),
  ('##рос', 0.2500462532043457),
  ('##лу', 0.015621528029441833),
  ('##ю', 0.00034396530827507377),
  ('жизнь', 3.4308295249938965),
  ('.', 0.15315811336040497)]]

```python

fivehills commented 2 years ago

I think that computing surprisal should be based on "linguistic token" (natural word) rather than wordpiece (split tokens). (https://explosion.ai/blog/spacy-transformers)

kanishkamisra commented 2 years ago

Hi - thanks again for a detailed demonstration. I see there being two problems here:

1) the issue of tokenization: this issue is traced back to the models itself -- the bert tokenizer does not always tokenize all words in a desirable manner, sometimes splitting apart words into their word pieces, as you already know. However, this issue cannot be fixed by minicons as it will require reconfiguring the tokenizer and re-training the model to treat tokens such as "symbolized" as single tokens, so it will have to be done at the model level. But in summary the reason why minicons produces sub-word level surprisals/log-probs is a consequence of the design decisions at the model level as opposed to minicons itself.

2) The second issue is that of actually producing outputs that merge together word-pieces and return per-word log probabilities -- this is a very valid issue and one that minicons may be able to fix, in theory! This will likely require using a separate third-party tokenizer to first split sentences, computing alignments between full words and the sub-words, and then combining log-probabilities for sub-words by summing them. There will definitely be a massive tradeoff with speed here, but in the long-run it might not end up mattering much. Unfortunately, I currently do not have fresh cycles to fully implement this but if you have ideas on how to go about it I would happily review a PR. I will keep this issue open until this feature is added.

fivehills commented 2 years ago

I found that some people proposed some ideas on alignment. https://www.lighttag.io/blog/sequence-labeling-with-transformers/example

However, I am not sure whether it is enough to calculate a surprisal value for a natural word given we just combine (i.e, sum up) log-probabilities for it split sub-words by summing them.

fivehills commented 2 years ago

Or this: https://www.mrklie.com/post/2020-09-26-pretokenized-bert/

kanishkamisra / minicons

incorrect tokenizers #20