Closed martinolmos closed 5 years ago
I don't have a windows machine to check this directly, but I believe the issue is the default encoding on your machine. If you run this prior to annotating the code to set the default encoding there is a good chance that will fix the issue:
options(encoding="UTF-8")
You could also try saving the text as a file directly (with the proper encoding) and passing in the file names. Or, just use one of the other backends... spacy and udpipe should run fine.
Your first solution worked perflectly. Thank you very much for your answer and for the package.
I'm trying to annotate a sentence in Spanish with cleanNLP and stanford-corenlp backend. When I inspect the output tokens I notice that all non-ascii characters were removed and the words with these characters were splited.
Here is a repoducible example:
Session info: