sdllc / Basic-Excel-R-Toolkit

http://bert-toolkit.com
207 stars 42 forks source link

Can't process Chinese properly #63

Open Yuanchao-Xu opened 7 years ago

Yuanchao-Xu commented 7 years ago

Thanks for the excellent add-in, but it seems that BERT doesn't have any Chinese decoding.

duncanwerner commented 7 years ago

BERT uses UTF-8. It should support Chinese everywhere except function names, but I believe that's a limitation in R itself. Can you give me a specific example of something not working?

Yuanchao-Xu commented 7 years ago

Originally, Chinese looks like this in BERT console: _20170324105100

The same problem occurred before in R console, and can be solved by changing language format of the operating system to Chinese simplified. I did the same this time, and in BERT console, it looks like this:

_20170714143710

So I guess it's some thing about Chinese decoding?

sessionInfo: _20170714144338 And the function I use is getPPPList from gfer's test version on github. It will scrape some Chinese info from a website.

Thanks

duncanwerner commented 7 years ago

The data is actually read in OK; it's just not displayed properly:

bert

The string encoding is marked as UTF-8, but the locale does not explicitly support UTF-8 so it is not printed. As far as I know (I could be wrong) Windows doesn't support UTF-8 locales. If you set the string encoding to "unknown", then it will print properly:

bert2

Apparently this is a common problem in Windows/R; see

http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/

I'm not entirely sure how to fix it, but I will work on it. Depending on what you need to do with the data, you may be able to ignore the problem (although it is definitely annoying), or unset the encoding on all the strings.

Yuanchao-Xu commented 7 years ago

Hi @duncanwerner , thank you for your detailed answer and solution for it.

It seems that this windows problem happens in two environment, R and BERT.

In R environment, actually, I've met with this problem a few times and it's fixed by changing Window's language format settings to Chinese format. Nothing else needs to be done in R, and characters can be printed normally in Rstudio. But this doesn't work in "BERT environment"

In BERT environment, the problem can be solved by the above solution you've mentioned, but it doesn't work in "R environment".

Or probably this is just due to the differences in Rstudio and BERT console.