simonw / s3-ocr

Tools for running OCR against files stored in S3
Apache License 2.0
116 stars 7 forks source link

status command should show if OCR has completed #17

Open simonw opened 2 years ago

simonw commented 2 years ago

This is actually quite difficult.

It turns out the textract-output/JOB_ID folder is created, empty, early on in the process. Then files called 1 and 2 and so-on are added to it - but they're not all added at once, so the existence of files in that folder doesn't necessarily mean that the OCR process has completed for that job ID.

simonw commented 2 years ago

I think the only reliable way of telling if OCR has completed is to call inspect-job:

But that's quite expensive, because it also returns the first page of JSON - which could be ~1MB of data.

I think the most efficient way to do this would be to check the expensive API for completion of each job, but then to update the .s3-ocr.json file for that key to cache the fact that we know that OCR has completed.

simonw commented 2 years ago

Another option: add a file called key.pdf.s3-ocr-complete.json indicating the OCR has finished. That way we don't need to GET each individual file to check status - we can check status on everything just by listing all keys in the bucket.

Even better: if we change the design of those JSON files to all live in the s3-ocr/ folder instead we can do a status check just with a single fetch of every key starting with that prefix, see: