gioelecrispo / chunkipy

chunkipy is an extremely useful tool for segmenting long texts into smaller chunks, based on either a character or token count. With customizable chunk sizes and splitting strategies, chunkipy provides flexibility and control for various text processing tasks.
MIT License
33 stars 0 forks source link

language short code between `stanza` and `lang_detect` are different causing issues in the library for some languages #2

Open omkarpat opened 10 months ago

omkarpat commented 10 months ago

here are the short codes for both the libraries lang_detect_languages = ['af','ar','bg','bn','ca','cs','cy','da','de','el','en','es','et','fa','fi','fr','gu','he','hi','hr','hu','id','it','ja','kn','ko','lt','lv','mk','ml','mr','ne','nl','no','pa','pl','pt','ro','ru','sk','sl','so','sq','sv','sw','ta','te','th','tl','tr','uk','ur','vi','zh-cn','zh-tw']

stanze_languages = ['af','ar','be','bg','bxr','ca','cop','cs','cu','da','de','el','en','es','et','eu','fa','fi','fr','fro','ga','gd','gl','got','grc','he','hi','hr','hsb','hu','hy','id','it','ja', 'kk','kmr','ko','la','lt','lv','lzh','mr','mt','nl','nn','no','olo','orv','pl','pt','ro','ru','sk','sl','sme','sr','sv','swl','ta','te','tr','ug','uk','ur','vi','wo','zh-hans','zh-hant'] maybe we can write a mapping between the two lists or for short term just consider overlapping languages. Here's a list of overlapping languages {'en', 'ur', 'te', 'he', 'ja', 'uk', 'ta', 'ko', 'pl', 'it', 'vi', 'cs', 'hr', 'lv', 'fi', 'ru', 'hu', 'hi', 'da', 'fr', 'af', 'tr', 'ca', 'sl', 'de', 'fa', 'mr', 'ro', 'el', 'ar', 'pt', 'sv', 'et', 'id', 'nl', 'sk', 'es', 'bg', 'no', 'lt'}

gioelecrispo commented 10 months ago

Hello omkarpat, that's a good shout! We can add it into text_splitter.py file, being it related to the sentence splitting logic. Feel free to open a pull request