simonw / s3-ocr

Tools for running OCR against files stored in S3
Apache License 2.0
115 stars 7 forks source link

LimitExceededException when calling the StartDocumentTextDetection operation #21

Closed ethanscorey closed 2 years ago

ethanscorey commented 2 years ago

Thank you for building this incredibly useful tool! I've found a lot of use for it recently, but I think I may have pushed it a bit beyond the scale it's built for.

I ran the line you included in the demo (s3-ocr start s3-ocr-demo --all -a ocr.json) on an S3 bucket that contains ~2,500 PDFs. It started Textract jobs for the first 102 PDFs in the bucket, but then it raised the following exception:

Traceback (most recent call last):
  File "/home/ethan/miniconda3/envs/nj_deaths/bin/s3-ocr", line 8, in <module>
    sys.exit(cli())
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/s3_ocr/cli.py", line 137, in start
    response = textract.start_document_text_detection(
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/botocore/client.py", line 508, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/botocore/client.py", line 915, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.LimitExceededException: An error occurred (LimitExceededException) when calling the StartDocumentTextDetection operation: Open jobs exceed maximum concurrent job limit

While it's fairly clear what caused the exception (running too many jobs at once), there's no obvious way to avoid it—aside from, of course, OCRing fewer PDFs at once, but who wants to do that?!

Is there a way to tell s3-ocr to chunk the jobs so that jobs that exceed the limit are queued to wait until the other jobs finish?

simonw commented 2 years ago

Oh this is an interesting one!

It looks like the default limit for that is listed here: https://docs.aws.amazon.com/general/latest/gr/textract.html

Maximum number of asynchronous jobs per account that can simultaneously exist

US East (N.Virginia): 600 US West (Oregon): 600 All other regions: 100

There's an easy fix and a harder fix.

The easier fix is to submit a maximum of 600 (or 100 depending on the region) at a time. You'll then have to call s3-ocr start ... again a second or third time to queue up the remaining documents, once those initial documents have been completed.

This is easier to build, but it kind of sucks.

The alternative solution is for s3-ocr start to grow the ability to queue up the maximum allowed, then stay running and poll Textract to see if any have been completed. As they are completed it can queue up more, until everything has been queued.

This is a lot trickier to build but would probably be a better option for most users.

simonw commented 2 years ago

Here's a simpler way to implement it: items are queued up one at a time here: https://github.com/simonw/s3-ocr/blob/ba47d9af8235aa6031493901c6f995a311f0f139/s3_ocr/cli.py#L156

I could catch that error here and add a retry after a sleep.

simonw commented 2 years ago

I'm going to have the default behavior be for it to retry if it sees this error, with a sleep that increases from 1s to 2s to 4s to 8s but then sticks at 8s.

I'll add an option to exit rather than retry - I'll call that --no-retry.

simonw commented 2 years ago

I need a bucket with 102 PDFs in it in an "other" region to test this out.

simonw commented 2 years ago

Running this:

mkdir -p /tmp/many-pdfs
for i in {1..102}
do
    echo $i
    echo "<h1>$i</h1>" > /tmp/many-pdfs/$i.html
    shot-scraper pdf /tmp/many-pdfs/$i.html -o /tmp/many-pdfs/$i.pdf
done
simonw commented 2 years ago

Created a bucket with:

s3-credentials create s3-ocr-many-pdfs --create-bucket --bucket-region eu-west-2

I put it in eu-west-2 (London) because that region has Textract support but also should have a 100 item limit.

I used Transmit to upload all 102 of those generated PDFs.

simonw commented 2 years ago

Got this error from s3-ocr start s3-ocr-many-pdfs --all:

botocore.errorfactory.InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentTextDetection operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

https://stackoverflow.com/a/64511389/6083 suggests that the fix is to pass the region here:

client = boto3.client('textract', region_name='us-east-2')
simonw commented 2 years ago

I tested it against the bucket with 102 items in it... but it submitted them all just fine, because Textract is fast enough to execute that the first few had already been processed by the time number 101 and 102 were submitted.

simonw commented 2 years ago

Creating a full 1,000 test PDFs:

for i in {102..1000}
do
    echo $i
    echo "<h1>$i</h1>" > /tmp/many-pdfs/$i.html
    shot-scraper pdf /tmp/many-pdfs/$i.html -o /tmp/many-pdfs/$i.pdf
done
simonw commented 2 years ago

Uploaded those to the S3 bucket too. About to run start there.

simonw commented 2 years ago

I'm failing to test this - Textract is just too fast!

image

That image shows that by the time I had submitted document 225 It had already processed the first 216.

simonw commented 2 years ago

I failed to test this against my own data for reasons shown above, but I did figure out how to test it using a mock.

simonw commented 2 years ago

Extracted a TIL from the tests: https://til.simonwillison.net/pytest/mocking-boto