interrogator / corpkit

A toolkit for corpus linguistics
Other
199 stars 27 forks source link

Parsing errors - "EOFError" and "UnicodeDecodeError" #44

Open bjornekstrom opened 7 years ago

bjornekstrom commented 7 years ago

Hello,

I'm currently using corpkit as a research tool for my master's thesis in library and information science as well as for experimenting on my spare-time. It works well but occasionally I get a few error messages when parsing a corpus which I don't understand. Could you perhaps explain them to me?

They're either

"EOFError: EOF when reading a line"

or

"UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)"

Most recently, these messages occured when trying to parse a plain text file consisting of James Joyce's Ulysses retrieved from Project Gutenberg (https://www.gutenberg.org/ebooks/4300). As far as I understand the file is encoded to UTF-8 and should work fine.

Thanks in advance.

interrogator commented 7 years ago

I’ll take a look at the problem in more detail, but you should be able to Go to "Help -> Show log” in the menu bar, which will give more details about the error. You can then upload that, and we'll get more information about the problem.

interrogator commented 7 years ago

So, using the latest version of the GUI (2.3.8) I simply downloaded that file as UTF-8 (no copy and pasting), and put it in a folder called 'ulysses'. I made a new project called 'Joyce', added the data folder, and hit 'Parse'. It split the file up (expected behaviour) and parsed them all with no problem.

Note that I'm using the 'development' version of the app, available via GitHub and the command line, but not yet downloadable as an 'app'. Are using the .app?