huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.07k stars 807 forks source link

UnigramTrainer: byte_fallback is false. #1515

Open Moddus opened 7 months ago

Moddus commented 7 months ago

Hi,

I'm using tokenizers version 0.19.1 and would like to train a unigram tokenizer using byte_fallback. Inspired by the unit test: https://github.com/huggingface/tokenizers/blob/main/bindings/python/tests/bindings/test_tokenizer.py#L434-L458 I created the following snippet:

import tokenizers
print(tokenizers.__version__)

from tokenizers.models import Unigram
from tokenizers import Tokenizer
from tokenizers.trainers import UnigramTrainer

vocab = [
    ("<unk>", 0.0),
    ("A", -0.01),
    ("sen", -0.02),
    ("te", -0.03),
    ("n", -0.04),
    ("ce", -0.05),
    ("<0xF0>", -0.06),
    ("<0x9F>", -0.06),
    ("<0xA4>", -0.06),
    ("<0x97>", -0.06),
    (" ", -0.4),
]
tokenizer = Tokenizer(Unigram(vocab, 0, byte_fallback=True))
trainer = UnigramTrainer(
    vocab_size=500,
    show_progress=True,
    special_tokens=["<s>", "</s>"],
    shrinking_factor=0.75,
    n_sub_iterations=2,
    max_piece_length=16,
    unk_token="<unk>",
    initial_alphabet=[],
)
tokenizer.train(files=["wiki_temp_data.txt"], trainer=trainer)
print(bytes(tokenizer.__getstate__()).decode('utf-8'))

which outputs the following:

0.19.1

{"version":"1.0","truncation":null,"padding":null,"added_tokens":[{"id":1,"content":"<s>","single_word":false,"lstrip":false,"rstrip":false,"normalized":false,"special":true},{"id":2,"content":"</s>","single_word":false,"lstrip":false,"rstrip":false,"normalized":false,"special":true}],"normalizer":null,"pre_tokenizer":null,"post_processor":null,"decoder":null,"model":{"type":"Unigram","unk_id":0,"vocab":[["<unk>",0.0],["<s>",0.0],["</s>",0.0],["\n",-3.5618887635943715],["o",-4.359534790533437],["c",-4.472372923155318],["l",-4.520273303341431],[".[",-4.6497854633966424],["u",-4.655008646705415],["a",-4.753463225483592],["t",-4.785768056441162],["s",-4.91371699444318],["re",-5.052257954891834],[", ",-5.148747109930639],["r",-5.155704136316791],["i",-5.2050088758473905],["5",-5.223325746599272],["d",-5.229627759933149],["machine learning",-5.234119673446196],[".",-5.28558347164545],["s ",-5.342809325968186],["m",-5.343139196351894],["or",-5.394192122750379],["3",-5.4046100154258845],["]",-5.410195605303625],["n",-5.4117339531013835],["the",-5.417127353177755],["B",-5.442617756876057],["ase",-5.451888821741718],["at",-5.4520247709192216],["ame",-5.460012431697334],[", and ",-5.465306179230262],["iv",-5.482661950016668],["pri",-5.488053506291787],["AWS",-5.503087712341066],["ul",-5.517158750532199],[" be",-5.526448341784843],["y",-5.530416529513077],["with ",-5.538365309973069],[", the company ",-5.567545716629353],["ment",-5.572913267208495],["er",-5.581586468703334],[" ",-5.582553606969525],[" on",-5.59770662724662],["g",-5.598627123101526],["ce",-5.648463699218988],["In ",-5.64976857181228],["the ",-5.652134450306691],[" m",-5.671611007098734],["its ",-5.69356540303516],["1",-5.699090152036743],[" billion ",-5.700685670347403],["le",-5.707536801549761],["e ",-5.744452677925564],["an",-5.747405527148935],["company a",-5.76710957286479],["mo",-5.77356594660073],["ne",-5.803388407856643],["In",-5.805229120183113],["in",-5.805624269889047],["p",-5.831409230907858],["illion ",-5.839912322890013],["S",-5.90423330792842],["d the ",-5.916101270150645],[" w",-5.92116893774722],["202",-5.923567684009055],["C",-5.927135148746793],["Hugging Face",-5.935188848047756],["w",-5.938182265910647],["D",-5.938477728620619],["M",-5.940192343684163],["0",-5.942597069385749],["brary built for ",-5.942617756836182],["-",-5.942617756836274],["Amazon",-5.94261775683631],["8",-5.942617756875978],["in New York City",-5.942617756876052],["\"",-5.942617756876125],["7",-5.942617756876125],[" language model",-5.942617760088956],[" funding round",-5.94261776132468],["The company was ",-5.942617763227],["2]",-5.942617785392672],["qu",-5.942617800603892],["esearch ",-5.942617828206498],[" raised ",-5.9426179931324565],["valuation.",-5.942618609748887],["fter ",-5.942621861434949],[" model",-5.942624959467915],["ral ",-5.942674485456473],["$4",-5.9428015717680225],[".[1] ",-5.9429617762159435],["ers to ",-5.9436013683314375],["ary ",-5.943750765812334],["][1",-5.944040904334429],[" company ",-5.945560856163892],["vi",-5.945681286950483],["sfor",-5.945855768978861],["part",-5.947119378352195],["eb",-5.948201549546719],[" ent",-5.948756047930036],[" Hugging Face",-5.950100675216206],["arge",-5.960532706696833],["ing ",-5.960738564204951],["atu",-5.965451418487716],["om",-5.970455719830765],[", a",-5.971704068272359],["s and ",-5.972571652098296],["announced ",-5.973061219889968],["hip ",-5.977549226823541],["ted ",-5.982023156137006],["ic",-5.986062235814846],["ien",-6.001519353842799],[" r",-6.007874241060973],["te",-6.03125923222429],["li",-6.050271236012039],[", In",-6.053498027366283],["that ",-6.059541800743297],["ation ",-6.064270974384604],[" is ",-6.072117612833478],["building ",-6.074556093837754],[" P",-6.074556878049191],["nnounced ",-6.079809447843454],["orkshop ",-6.085753386155262],["applications ",-6.088689728081486],["as",-6.123484361241108],["for ",-6.124833605568351],[" French",-6.12679637196849],["pre",-6.128901461700658],["at ",-6.129105738019865],["di",-6.129436049470789],["ed ",-6.133899267964393],["il",-6.136013226375961],["open sourc",-6.167592170576895],["2021, ",-6.2230497026795275],[" De",-6.225284352655487],["um",-6.22719708120699],["in a Series ",-6.247499674069367],["on",-6.2535011177649436],["6 b",-6.260055973996533],[", 202",-6.279707971380914],["ch",-6.280553730569196],["pro",-6.288668186959201],["h",-6.316245777016983],["un",-6.316638514949789],["ces",-6.321592328602628],["b",-6.327624545643838],[" 2022, the ",-6.337989754658159],["e, ",-6.348602860978678],["2, the company a",-6.356440552853303],["fo",-6.357114653767249],["t ",-6.3579674589715935],["3, ",-6.361514488346833],["ed a",-6.366018590176099],["to",-6.375869392966589],[" pro",-6.386023882624574],["6",-6.402580435368394],["achine learning.",-6.412867404592184],["chatbot",-6.432056036150724],["August ",-6.451230373098852],["ill",-6.488279557506923],["to ",-6.4957862665090005],["ration ",-6.500766268508913],["ol",-6.528097243910025],["O",-6.531293662690236],[" c",-6.544558595693722],["the company ",-6.548420235235166],[", an",-6.552003978461224],["20",-6.5714460600809925],[" W",-6.57319719236091],["y ",-6.574144743303847],[" with ",-6.615445138804277],["th",-6.6162815057783035],["ig",-6.672043110614677],["ia",-6.704591676241498],["s a",-6.72076842509261],[", I",-6.7235753182870495],["allow",-6.730694223582763],["of BLOOM",-6.772198447217853],["notable ",-6.773869825243883],["en",-6.778288569020939],["their ",-6.800430911695209],[" G",-6.802055932562853],["rs",-6.8231056449791145],["ers",-6.824409702671997],["f",-6.832724908701927],["va",-6.833887956631011],["ing",-6.8461656305816],["nc",-6.8564613891343456],["sing ",-6.862714167046888],["ual",-6.870392635028106],["e",-6.874106905398646],["ra",-6.877618004485145],["ha",-6.880517657124372],["2022, the ",-6.880641613459034],["se ",-6.88206029830566],["et",-6.891310527153034],["ts a",-6.893115500775542],["hi",-6.897448626895353],["$2",-6.899411433346739],["rge",-6.903452770436075],["Ma",-6.909472441232638],["A",-6.910508982881438],["la",-6.912738779604821],[" Hub",-6.917091307959003],[" applications",-6.924973808675421],["ed by ",-6.929914746108062],[" I",-6.930349239598647],[" The ",-6.935680206033779],["custom",-6.936174270128891],["that develop",-6.936238436643946],["par",-6.937263957185268],["develop",-6.938138497515938],[" led by ",-6.938739492904418],["platform ",-6.938946197143462],["][",-6.938996393776529],["funding",-6.939247240471008],["work",-6.939408735404648],["langu",-6.94142090155035],["4",-6.942149067883904],[" f",-6.942426760847154],["Th",-6.9424990734650365],[" 2",-6.942565176177402],["H",-6.942596832809822],[" T",-6.942605352612952],[" an open ",-6.942617752886388],["x",-6.94261775683635],["j",-6.94261775683635],[")",-6.94261775683635],["(",-6.94261775683635],["'",-6.942617756836539],["J",-6.942617756875563],["é",-6.942617756875563],["R",-6.942617756875751],["9",-6.942617756875937],["U",-6.942617756876125],["Q",-6.942617756876125],["N",-6.942617756876125],["ion of ",-6.942617757554927],["ugging ",-6.942618105571962],["Series ",-6.94261958890791],["] The company ",-6.942628961683146],[" F",-6.942653343811279],[" app",-6.94266383237332],["ks",-6.942700948240126],[" Ser",-6.942739315731812],["mers ",-6.942795537104784],["rch ",-6.942826671123916],["ed in ",-6.942835263051036],["able ",-6.943206984157975],["o s",-6.943328294124164],["On A",-6.943572252037589],["rou",-6.943673679272364],[" C",-6.943814459897256],["language ",-6.943817108739665],["eve",-6.9452628125793545],["s to ",-6.945644320637266],["s us",-6.945921594624043],[" funding",-6.946000345996118],[" platform ",-6.946301575031875],["m a",-6.947043496984786],["all",-6.947911090626805],["ts S",-6.948326154283556],[" custom",-6.9491454063910165],["its p",-6.949344864784999],[". The ",-6.949730955894585],["cus",-6.952726303447235],[", A",-6.952844319402885],[" that develop",-6.953556340944786],["lab",-6.954929403855246],[" that ",-6.955355393158771],[" was ",-6.95610830182569],["se the",-6.958201273083629],["led by ",-6.9594534383841395],["s for ",-6.959871560393215],["ded ",-6.95999569518688],[" t",-6.960368331008011],["bl",-6.960432445701965],[" for ",-6.961924120505193],["hin",-6.961949986947003],["ing a",-6.963534193375732],["Am",-6.964258493398582],["Se",-6.9668911986350945],["cem",-6.966928326002063],["1]",-6.967544992521077],["e Hub",-6.968758015087302],["y a",-6.971433123917773],[" applications ",-6.971600826180069],["com",-6.9733430729904615],["pp",-6.9736709095928],["ace",-6.97373535481078],["ise",-6.973756594655904],["Sa",-6.976268488813586],["ip",-6.979926234084208],["S ",-6.979993624320898],["sh",-6.980794457976515],["s C",-6.981885295520755],["ers.[",-6.983624891013555],["f ",-6.987282727412509],["und",-6.988964943279061],["sho",-6.988969488078771],[".[1",-6.988993185931962],["ub",-6.995597726015328],["ry",-6.999509416842209],["d a",-7.014614036175976],["ru",-7.016549016354882],["al la",-7.0208460366466845],["de",-7.021052333128501],["rai",-7.027232359454722],["ed the ",-7.027715894922059],[" e",-7.03182377859982],["age",-7.032093889404633],["lic",-7.033254059925991],["ur",-7.037581995768221],["ai",-7.041887600575064],["pl",-7.0546079387230645],["ls",-7.0558322435575045],["] ",-7.060207289571837],["n Ma",-7.0651558992287455],["lu",-7.069375277346813],[" wo",-7.075971379825189],["ni",-7.086318785386894],["oc",-7.100830640759104],["G",-7.103254910735055],["tr",-7.104249368940073],["ing m",-7.10570202038382],[" their ",-7.108717347268438],["it",-7.1352966726046745],["co",-7.136916541617866],[" notable ",-7.141166741089615],["oo",-7.143206475553617],[" of BLOOM",-7.143482420058712],["ce ",-7.1526528451491105],["and ",-7.174441401028649],[" allow",-7.203836432853912],["ir",-7.206199733901988],["us",-7.218561318537739],["ation",-7.22574208913863],["ar",-7.237026026415393],["st",-7.299761686771174],[" $2",-7.318878555687618],["W",-7.318959962447763],[" co",-7.332833486790859],["sto",-7.348712153867819],["gin",-7.374823313242877],["be",-7.421538654299918],["ene",-7.514912142305127],[" and ",-7.516897737049053],[" Wo",-7.581172621326533],["On ",-7.592055788002298],[", an ",-7.6120260522472165],["ver",-7.785923481230139],["nd",-7.80414028955236],["n August ",-7.822992628393436],[" chatbot",-7.8875106061030245],[" comp",-7.96633485775869],["ed a ",-7.997733717237862],["P",-8.113808575924756],["F",-8.113908575924755],["T",-8.114008575924757],["$",-8.114108575924757],["v",-8.114208575924756],["z",-8.114308575924756],["L",-8.114408575924756],["Y",-8.114508575924756],["I",-8.114608575924755],[",",-8.114708575924755],["q",-8.114808575924757],["k",-8.114908575924757],["[",-8.115008575924756],["2",-8.115108575924756],["announce",-8.115108575924756]],"byte_fallback":false}}

where I can find "byte_fallback":false. To check if byte fallback is happening:

output = tokenizer.encode("河")
print(output.ids)
print(output.tokens)

[0]  # no byte fallback
['河']

I cause if likely the UnigramTrainer in do_train defaults to byte_fallback false in:

Does the analysis makes sense?

Best

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 5 months ago

Yes your analysis makes sense! TLDR the support for bytefallback training was not added, and we should do so!

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.