stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Looking for Help #114

Open yassine-saoudi opened 6 years ago

yassine-saoudi commented 6 years ago

when I used the methods of Word Embeddings: with GloVe model for distributed word representation for Arabic Text (Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) ) but I have errors :

"TypeError: UnicodeDecodeError: 'utf8' codec can't decode byte 0xba in position 2: invalid start byte"

I tried to load the model with ignoring unicode errors, (unicode_errors='ignore') but it didn't solve the problem. can you help and orientate me to solve this error code?

regards.

npeirson commented 6 years ago

Hi! This error is essentially saying that some file at whatever line is being imported, but it can't be understood as 'utf8'. There are many possible causes, some of which I have listed below (and hopefully other folks can contribute more).

Most likely, you may have something (like a header, a shebang, or an optional function-variable), somewhere (like in the file being opened, or the code that is opening it) forcing something to be interpreted as utf8 when it shouldn't be, or forcing something to be interpreted as something else when it ought to be utf8.

Other potential causes include...

I hope that helps!

yassine-saoudi commented 6 years ago

thanks npeirson

anjalibhavan commented 5 years ago

Hi, has this been solved for you @yassine-saoudi ? I am facing the same problem and would like to know what you did regarding this.

yassine-saoudi commented 5 years ago

Hi Ms Anjali Bhavan, If you are using the target language is the Arabic language, this type of error will be very common. The main causes are related to the preprocessing step of the training data, I propose to use the function "is_valid_arabic_word (word)" and I insist to eliminate the Arabic comma « ، ». best regards. ᐧ

Le jeu. 4 avr. 2019 à 16:06, Anjali Bhavan notifications@github.com a écrit :

Hi, has this been solved for you @yassine-saoudi https://github.com/yassine-saoudi ? I am facing the same problem and would like to know what you did regarding this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/114#issuecomment-479936622, or mute the thread https://github.com/notifications/unsubscribe-auth/AfW9WGJ_phJ3148Wm7xV4HfJqcfB0fHBks5vdhUSgaJpZM4TfDSn .

-- Saoudi Yassine