Open Yuanchao-Xu opened 7 years ago
BERT uses UTF-8. It should support Chinese everywhere except function names, but I believe that's a limitation in R itself. Can you give me a specific example of something not working?
Originally, Chinese looks like this in BERT console:
The same problem occurred before in R console, and can be solved by changing language format of the operating system to Chinese simplified. I did the same this time, and in BERT console, it looks like this:
So I guess it's some thing about Chinese decoding?
sessionInfo:
And the function I use is getPPPList
from gfer
's test version on github. It will scrape some Chinese info from a website.
Thanks
The data is actually read in OK; it's just not displayed properly:
The string encoding is marked as UTF-8, but the locale does not explicitly support UTF-8 so it is not printed. As far as I know (I could be wrong) Windows doesn't support UTF-8 locales. If you set the string encoding to "unknown", then it will print properly:
Apparently this is a common problem in Windows/R; see
http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/
I'm not entirely sure how to fix it, but I will work on it. Depending on what you need to do with the data, you may be able to ignore the problem (although it is definitely annoying), or unset the encoding on all the strings.
Hi @duncanwerner , thank you for your detailed answer and solution for it.
It seems that this windows problem happens in two environment, R and BERT.
In R environment, actually, I've met with this problem a few times and it's fixed by changing Window's language format settings to Chinese format. Nothing else needs to be done in R, and characters can be printed normally in Rstudio. But this doesn't work in "BERT environment"
In BERT environment, the problem can be solved by the above solution you've mentioned, but it doesn't work in "R environment".
Or probably this is just due to the differences in Rstudio and BERT console.
Thanks for the excellent add-in, but it seems that BERT doesn't have any Chinese decoding.