str_split not splitting correctly on Unicode character

tidyverse / stringr

A fresh approach to string manipulation in R

https://stringr.tidyverse.org

Other

603 stars 187 forks source link

str_split not splitting correctly on Unicode character #542

Closed alexanderbeatson closed 3 months ago

alexanderbeatson commented 6 months ago

I am trying to split Burmese Unicode characters in stringr::str_split() but not return the correct values.

str_split("စမ်းသပ်မှု", "")[[1]]

it returns:

[1] "စ" "မ်" "း" "သ" "ပ်" "မှု"

If I use buildin strsplit: strsplit("စမ်းသပ်မှု", "")[[1]] it returns character level:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

I found that str_split treat "" empty string as regex but stringr::str_split() does not return neither character nor syllable:

[1] "စမ်း" "သပ်" "မှု"

So, I don't think it is actually a feature like Issue:88

For further study, if possible, could someone guide me where this splitting is coming from? I found that other services like Google also use this incorrect splitting format. TIA.

gagolews commented 6 months ago

... and what would be the correct result?

alexanderbeatson commented 6 months ago

Correct return should be:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

hadley commented 3 months ago

All I know about Burmese is what I've just read about on wikipedia, but it sounds like you're looking to break up into individual code points, not characters (which because Burmese is a abugida, not an alphabet, represent syllables, not individual vowels and consonants).

I don't see an obvious way to do this with stringi, but @gagolews might.

alexanderbeatson commented 3 months ago

@hadley Thank you for raising the point. Burmese is indeed an abugida.

I understand that all of pseudo-alphabet languages have their own structure and confusing, and there might even controversial breakdown system.

Please let me explain in detail of breaking down the phrase "စမ်းသပ်မှု" (meaning "testing" or "test")

"စမ်းသပ်မှု" is a single word
contains 3 distinct syllables ["စမ်း", "သပ်", "မှု"]

str_split() is trying to break the syllables into (grammatically) illegal groups. For example, it breaks "စမ်း" into ["စ", "မ်", "း"] that ["မ်", "း"] are grammatically illegal to standalone.

I am a native Burmese NLP researcher and I believe I could help in this implementation. I recently developed bursyl, regex-based Burmese syllabification algorithm (with a very strict grammatical rule but can be adjusted), and potentially implement it into stringi for splitting Burmese langauge @gagolews ?

gagolews commented 3 months ago

On a side note, https://unicode-org.github.io/icu/userguide/boundaryanalysis/ says that:

*Dictionary-Based BreakIterator

Some languages are written without spaces, and word and line breaking requires more than rules over character sequences. ICU provides dictionary support for word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese.

Use of the dictionaries is automatic when text in one of the dictionary languages is encountered. There is no separate API, and no extra programming steps required by applications making use of the dictionaries.*