siara-cc / Unishox_Arduino_Progmem_lib

Retrieve compressed UTF-8 strings from Arduino Flash memory (Progmem)
Apache License 2.0
21 stars 2 forks source link

support for another languages?? #1

Open atesin opened 2 years ago

atesin commented 2 years ago

hi... i am chilean

i think this lib anyway could be useful to compress another languages texts on arduino flash (like spanish on my case) , but may be not so efficient (or maybe am i wrong??)

there are some letter combinations in english that are not so common or not used at all in spanish like "th" the "k" letter, "ph" "sch" "w" and so on, and however there are other combinations like "que", "gui", "amb", "anv", etc. thar are fairly common

i dont know the linguistic/frequency analysis behind this library, but i feel that in search for efficiency it should be a library for each language or group of similar languages

how this analysis is done? .. could be it done for other languages by supplyng large texts to be analyzed??

thanks

siara-cc commented 2 years ago

Hi, You are right this library is tuned for English language letter frequency and not for specifics of other languages. However, I think you will still get good compression ratio for Spanish. For example, for the string "Debe su notoriedad a su colaboración con el guionista, productor y director George Lucas, que fue el primero en darle la posibilidad de ser actor." the original size is 147 bytes. Compressed by Unishox: 91 bytes Compressed by Smaz: 94 bytes Compressed by Shoco: 110 bytes I do have plans to improve compression even further by taking into account specifics of the language being compressed in future.