simonw / s3-ocr

Tools for running OCR against files stored in S3
Apache License 2.0
116 stars 7 forks source link

Not all pages are ocr'd, but Textract claims otherwise #28

Closed captnswing closed 3 months ago

captnswing commented 2 years ago

First of all, I love this, thanks for creating s3-ocr!

Everything works as it should, following the instructions in your TIL https://simonwillison.net/2022/Jun/30/s3-ocr/

...except that the very first PDF file I tried it with has 79 pages, of which only 25 are OCR'd (but s3-ocr inspect-job... shows 79 pages successfully processed)

Which makes me think something is wrong on the Textract side, and not s3-ocr. So I would like to get to the debug output of Textract, to understand more.

Log of my session follows, FYI. I can furnish you with the pdf I am trying this with through a backchannel.


$ pdfinfo règlement\ copro.pdf
Creator:         RICOH Aficio MP 9002
Producer:        RICOH Aficio MP 9002
CreationDate:    Mon Jun 22 10:19:09 2015 CEST
ModDate:         Mon Jun 22 10:19:09 2015 CEST
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           79
Encrypted:       no
Page size:       841.92 x 595.2 pts (A4)
Page rot:        270
File size:       7821658 bytes
Optimized:       no
PDF version:     1.4
$ s3-credentials put-object net.captnswing.s3-ocr règlement\ copro.pdf règlement\ copro.pdf -a ocr.json
Uploading  [####################################]  100%
$ s3-credentials list-bucket net.captnswing.s3-ocr -a ocr.json
[
  {
    "Key": "r\u00e8glement copro.pdf",
    "LastModified": "2022-08-12 06:49:30+00:00",
    "ETag": "\"e0a64ab99ca450caafc67017521d0fae\"",
    "Size": 7821658,
    "StorageClass": "STANDARD"
  }
]
$ s3-ocr start net.captnswing.s3-ocr --all -a ocr.json
Found 0 files with .s3-ocr.json out of 1 PDFs
Starting OCR for règlement copro.pdf, Job ID: f369b8334fd34e06f83e0f5645cc31d113b1b9fc2bbfbcb9696d28e5a43a6aad
$ s3-ocr status net.captnswing.s3-ocr -a ocr.json
1 complete out of 1 jobs
$ s3-ocr inspect-job f369b8334fd34e06f83e0f5645cc31d113b1b9fc2bbfbcb9696d28e5a43a6aad
{
  "DocumentMetadata": {
    "Pages": 79
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}
$ s3-ocr fetch net.captnswing.s3-ocr règlement\ copro.pdf
$ ls -l f369b8334fd34e06f83e0f5645cc31d113b1b9fc2bbfbcb9696d28e5a43a6aad*json | wc -l
25
simonw commented 2 years ago

I don't believe Textract provides any debug log output other than what ends up in those JSON files that you can access using s3-ocr fetch.

simonw commented 2 years ago

Could you share a copy of the PDF that you're having trouble with? I'd like to dig in further. You can share here or email it to me at swillison at Google's email service.

captnswing commented 2 years ago

Thank you Simon! I sent you that pdf through email...

I actually have a bunch of pdfs like it, where not all pages are scanned (even though Textract jobstatus claims otherwise). Makes me wonder if there's some additional configuration needed in my AWS account for Textract... 🤔

captnswing commented 3 months ago

Just circling back here with my learnings - FYI

I've been trying different things: Textract, Tesseract, etc. All kinds of different wrapper scripts (ocrmypdf etc) to wrangle my old swedish and french pdfs (always scanned, often badly scanned).

I noticed how the OCR function of my favorite CleanShot X tool never struggles to get the correct text from pdfs that all the other tools failed at.

So now I'm using the Mac Vision framework for OCR, via the pyobjc-framework-Vision.

This produces superior results - in OCR accuracy and processing speed. The OCR accuracy on even the weirdest old french (badly scanned) pdfs is vastly improved.

But ofc, only works on a Mac....