interrogator / corpkit

A toolkit for corpus linguistics
Other
199 stars 27 forks source link

UnicodeDecodeError #49

Open sspina opened 6 years ago

sspina commented 6 years ago

Hello,

when I try to parse a corpus, I get the following error message: UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 33: ordinal not in range(128) I attach the log file

Thank you for your help,

Stefania

log-02.txt

interrogator commented 6 years ago

Hey,

Sorry, since corpkit I’ve more or less moved onto other projects, and don’t know if I’ll have time to make any needed fix.

The parsing seems to be caused by character encodings in the text. Meaning, there are probably non-standard characters in there, like umlauts or something.

https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte

see here for more information.

If I do manage to get back to this project I’ll bear this in mind.

On 4 May 2018, at 10:36 pm, sspina notifications@github.com wrote:

Hello,

when I try to parse a corpus, I get the following error message: UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 33: ordinal not in range(128) I attach the log file

Thank you for your help,

Stefania

log-02.txt https://github.com/interrogator/corpkit/files/1975812/log-02.txt — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/interrogator/corpkit/issues/49, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ_G3B-lVRLgIA0s1oyd9_2fW6aeY_i9ks5tvLvJgaJpZM4TzPL_.