Closed ethanscorey closed 2 years ago
Oh this is an interesting one!
It looks like the default limit for that is listed here: https://docs.aws.amazon.com/general/latest/gr/textract.html
Maximum number of asynchronous jobs per account that can simultaneously exist
US East (N.Virginia): 600 US West (Oregon): 600 All other regions: 100
There's an easy fix and a harder fix.
The easier fix is to submit a maximum of 600 (or 100 depending on the region) at a time. You'll then have to call s3-ocr start ...
again a second or third time to queue up the remaining documents, once those initial documents have been completed.
This is easier to build, but it kind of sucks.
The alternative solution is for s3-ocr start
to grow the ability to queue up the maximum allowed, then stay running and poll Textract to see if any have been completed. As they are completed it can queue up more, until everything has been queued.
This is a lot trickier to build but would probably be a better option for most users.
Here's a simpler way to implement it: items are queued up one at a time here: https://github.com/simonw/s3-ocr/blob/ba47d9af8235aa6031493901c6f995a311f0f139/s3_ocr/cli.py#L156
I could catch that error here and add a retry after a sleep.
I'm going to have the default behavior be for it to retry if it sees this error, with a sleep that increases from 1s to 2s to 4s to 8s but then sticks at 8s.
I'll add an option to exit rather than retry - I'll call that --no-retry
.
I need a bucket with 102 PDFs in it in an "other" region to test this out.
Running this:
mkdir -p /tmp/many-pdfs
for i in {1..102}
do
echo $i
echo "<h1>$i</h1>" > /tmp/many-pdfs/$i.html
shot-scraper pdf /tmp/many-pdfs/$i.html -o /tmp/many-pdfs/$i.pdf
done
Created a bucket with:
s3-credentials create s3-ocr-many-pdfs --create-bucket --bucket-region eu-west-2
I put it in eu-west-2
(London) because that region has Textract support but also should have a 100 item limit.
I used Transmit to upload all 102 of those generated PDFs.
Got this error from s3-ocr start s3-ocr-many-pdfs --all
:
botocore.errorfactory.InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentTextDetection operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.
https://stackoverflow.com/a/64511389/6083 suggests that the fix is to pass the region here:
client = boto3.client('textract', region_name='us-east-2')
I tested it against the bucket with 102 items in it... but it submitted them all just fine, because Textract is fast enough to execute that the first few had already been processed by the time number 101 and 102 were submitted.
Creating a full 1,000 test PDFs:
for i in {102..1000}
do
echo $i
echo "<h1>$i</h1>" > /tmp/many-pdfs/$i.html
shot-scraper pdf /tmp/many-pdfs/$i.html -o /tmp/many-pdfs/$i.pdf
done
Uploaded those to the S3 bucket too. About to run start
there.
I'm failing to test this - Textract is just too fast!
That image shows that by the time I had submitted document 225 It had already processed the first 216.
I failed to test this against my own data for reasons shown above, but I did figure out how to test it using a mock.
Extracted a TIL from the tests: https://til.simonwillison.net/pytest/mocking-boto
Thank you for building this incredibly useful tool! I've found a lot of use for it recently, but I think I may have pushed it a bit beyond the scale it's built for.
I ran the line you included in the demo (
s3-ocr start s3-ocr-demo --all -a ocr.json
) on an S3 bucket that contains ~2,500 PDFs. It started Textract jobs for the first 102 PDFs in the bucket, but then it raised the following exception:While it's fairly clear what caused the exception (running too many jobs at once), there's no obvious way to avoid it—aside from, of course, OCRing fewer PDFs at once, but who wants to do that?!
Is there a way to tell s3-ocr to chunk the jobs so that jobs that exceed the limit are queued to wait until the other jobs finish?