Open simonw opened 2 years ago
I think the only reliable way of telling if OCR has completed is to call inspect-job
:
But that's quite expensive, because it also returns the first page of JSON - which could be ~1MB of data.
I think the most efficient way to do this would be to check the expensive API for completion of each job, but then to update the .s3-ocr.json
file for that key to cache the fact that we know that OCR has completed.
Another option: add a file called key.pdf.s3-ocr-complete.json
indicating the OCR has finished. That way we don't need to GET each individual file to check status - we can check status on everything just by listing all keys in the bucket.
Even better: if we change the design of those JSON files to all live in the s3-ocr/
folder instead we can do a status check just with a single fetch of every key starting with that prefix, see:
This is actually quite difficult.
It turns out the
textract-output/JOB_ID
folder is created, empty, early on in the process. Then files called1
and2
and so-on are added to it - but they're not all added at once, so the existence of files in that folder doesn't necessarily mean that the OCR process has completed for that job ID.