ropensci / cld3

Bindings to Google's Compact Language Detector 3
https://docs.ropensci.org/cld3
41 stars 5 forks source link

detect_language_mixed(): R Session Crashing when running on empty entries #3

Open TimBMK opened 3 years ago

TimBMK commented 3 years ago

Hey!

I have a large dataset of mixed-language entries (assume 100k+) that I want to run cld3's language detection on in order to detect non-english language snippets. However, I was running into the problem with the R Session aborting (fatal error) as soon as I try to run it over certain entries. I could isolate the problem and it seems that as soon as it hit an empty entry ("") , it would fail and take the whole session down with it. cld2::detect_language_mixed and cld3::detect_language() both do not seem to have that issue, so I'm assuming it would be an easy fix to escape these entries and return NA. Seeing that it took me a while to figure out, it might save quite a bit of heartache to implement this in the next update though. I'm running the latest cld3 release from CRAN (1.4.1).

Also, thanks for the great package! It's really helpful seeing that it seems to deal better with multi-language entries than cld2.

jeroen commented 3 years ago

Can you try to create a minimal reproducible example?

TimBMK commented 3 years ago

test <- "" cld3::detect_language_mixed(test)

jeroen commented 3 years ago

oh wow haha that is embarrassing

TimBMK commented 3 years ago

Probably just a little slip up somewhere, haha. When I remove the empty entries it runs like a charm!