invalid start byte errors with JSTOR files

inpho / vsm

Vector Space Model Framework developed for InPhO

http://inpho.github.io/vsm

Other

36 stars 14 forks source link

invalid start byte errors with JSTOR files #125

Closed colinallen closed 8 years ago

colinallen commented 8 years ago

Many of the JSTOR files are throwing up during vsm init with this error:

UnicodeDecodeError: 'utf8' codec can't decode byte ... in position ...: invalid start byte

The message for file 685266.pdf has ... 0xf6 (= ö) in position 97...

The text file 685266.txt indeed has ö at that position (in middle of Schrödinger), but changing it to a plain 'o' allows everything to proceed.

colinallen commented 8 years ago

For 685552.txt same character (same author) in position 96. Elsewhere in the document, the accents are missing from "Schrodinger". This is first page content of JSTOR pdf, and is their rending of the JSTOR metadata. So, my hypothesis is that their OCR txt content starts on the second page and has no unicode, but their first page is directly generated from their metadata, and they are including characters but not declaring the encoding correctly.

colinallen commented 8 years ago

Stripped out all the first pages and still getting similar errors, so not confined to initial page, apparently.

JaimieMurdock commented 8 years ago

Fixed via inpho/vsm v0.4a7. Now falls back to auto-detect encoding when UnicodeDecodeError is raised. This particular issue was due to the files being converted to Latin-1 encoding by pdf2text, which may be a LANG variable issue.