uhermjakob / utoken

universal tokenizer
BSD 3-Clause "New" or "Revised" License
13 stars 1 forks source link

Can we have a list of lang codes? #6

Open tomersagi opened 1 year ago

tomersagi commented 1 year ago

Hi, Great job with this tokenizer. Can we have a list of lang codes? It is not clear which ISO standard you are using for language codes, specifically for languages where there is a modern and ancient version. Thanks

jcuenod commented 1 year ago

The cli interface instructions include:

--lc LANGUAGE-CODE ISO 639-3