statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
209 stars 36 forks source link

cnlp_annotate confusing strings with UTF-8 BOM #50

Closed afkoeppel closed 4 years ago

afkoeppel commented 5 years ago

I'm having a very strange issue with cleanNLP (v2.0.3), using R 3.4.4, in Rstudio 1.1.456, on Windows.

I've loaded a custom spaCy NER model set up to detect a new NER category. The model works fine with spaCy in python, and also works fine in R for the vast majority of the text strings I'm testing it on. However, on strings that contain the word "first", the cnlp_annotate() command in R fails, and I get the following error:

Error in py_call_impl(callable, dots$args, dots$keywords) : UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 18: character maps to .

Google told me that '\ufeff' is a UTF-8 BOM, so I tried iconv(), and various conversion tools in stringr and stringi to see if I could detect or filter out the offending characters. In searching for a common feature I eventually discovered that all of the failing strings contained the word "first". Furthermore, even strings manually typed (no read-in or copy-paste) into Rstudio failed to annotate with the same error if and only if they contain the word "first".
Manually typed string, "The first." fails to annotate with the '\ufeff' error.
Manually typed string, "The firs." works just fine. Manually typed string, "The irst." works just fine.

Needless to say, I am baffled by this. The issue does not occur with spaCy's own language model (en_core_web_sm-2.1.0), only with my custom model. This would lead me to believe it was an issue with the model, except that I can't reproduce the error when running the same model using spaCy directly in python, only with cnlp_annotate() in R.

Any thoughts or advice would be appreciated. I can always work around the issue by working in python (or, I suppose, by temporarily replacing all instances of "first" with "ferst" (which also causes no errors)), but this was so odd to me that I had to at least ask about it. Thanks.

P.S. I'm not at liberty to share the specific spaCy model that is failing, but if it helps, I used code pretty much identical to the spaCy code here: https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py (run to train a blank NER model, not to update the existing model, and of course with different training data).

statsmaths commented 4 years ago

That's a strange phenomenon, and I have no idea off the top of my head why it's causing that problem. I just released a new version of cleanNLP (3.0.0) that might fix the issue. Otherwise, I'm at a loss if you cannot share the file causing the issue because I can't replicate the problem.