Closed timalamenciak closed 3 months ago
Update on this - I had the error crop up again when copying-and-pasting from a PDF, so I dug into the code. This block appears to be the challenge (lines 324-329 of cli.py):
if use_textract:
import textract
text = textract.process(inputfile).decode("utf-8")
else:
text = open(inputfile, "r").read()
On my own version, I added an ignore flag to the text open file. This will ignore improperly formatted characters, which may lose data, but I think in this package's use case, that won't be crippling.
if use_textract:
import textract
text = textract.process(inputfile).decode("utf-8")
else:
text = open(inputfile, "r", **errors="ignore"**).read()
Textract is still not working.
Might just fix this with #421. In the meantime, I'll have a fix here shortly along the lines of what you suggest - though I don't recommend parsing entire PDFs with it unless you want to get a lot of unreadable characters.
Hi @timalamenciak - give PDF parsing a try in v1.0.2 (just released) - it now uses the option --use-pdf
instead of --use-textract
Thrilling! That worked.
Thanks @caufieldjh !
Trying to pull in the PDF from this article throws the below error: https://onlinelibrary.wiley.com/doi/10.1002/eco.1705
This has been tested on other PDFs to the same end.