Closed captnswing closed 3 months ago
I don't believe Textract provides any debug log output other than what ends up in those JSON files that you can access using s3-ocr fetch
.
Could you share a copy of the PDF that you're having trouble with? I'd like to dig in further. You can share here or email it to me at swillison
at Google's email service.
Thank you Simon! I sent you that pdf through email...
I actually have a bunch of pdfs like it, where not all pages are scanned (even though Textract jobstatus claims otherwise). Makes me wonder if there's some additional configuration needed in my AWS account for Textract... 🤔
Just circling back here with my learnings - FYI
I've been trying different things: Textract, Tesseract, etc. All kinds of different wrapper scripts (ocrmypdf etc) to wrangle my old swedish and french pdfs (always scanned, often badly scanned).
I noticed how the OCR function of my favorite CleanShot X tool never struggles to get the correct text from pdfs that all the other tools failed at.
So now I'm using the Mac Vision framework for OCR, via the pyobjc-framework-Vision.
This produces superior results - in OCR accuracy and processing speed. The OCR accuracy on even the weirdest old french (badly scanned) pdfs is vastly improved.
But ofc, only works on a Mac....
First of all, I love this, thanks for creating s3-ocr!
Everything works as it should, following the instructions in your TIL https://simonwillison.net/2022/Jun/30/s3-ocr/
...except that the very first PDF file I tried it with has 79 pages, of which only 25 are OCR'd (but
s3-ocr inspect-job...
shows 79 pages successfully processed)Which makes me think something is wrong on the Textract side, and not s3-ocr. So I would like to get to the debug output of Textract, to understand more.
Log of my session follows, FYI. I can furnish you with the pdf I am trying this with through a backchannel.