Cross project bucket access with batch_document_process

googleapis / python-documentai-toolbox

Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from JSON files in Cloud Storage, local JSON files, or output directly from the Document AI API.

https://cloud.google.com/document-ai/docs/toolbox

Apache License 2.0

33 stars 13 forks source link

Cross project bucket access with batch_document_process #296

Closed nittonfemton closed 3 months ago

nittonfemton commented 4 months ago

I have all my processors in one project and need to run them on documents on a bucket within another project. I have given "storage admin" to the service role, and verified access by downloading files from script with my DocumentAI service role.

However then running batch_process_documents() on these files it always return error: Error: 400 Failed to process all documents. 3: Failed to process all documents.

Any help appreciated.

nittonfemton commented 4 months ago

Actually no, cross platform does not seem to be the problem. gs://xxx/7mOLZo4BPUtxfx7Yx6gD-48190d5322acc68181b84fec05194c4a.pdf

Does not work, but gs://xxx/1.pdf works. So can hopefully resolve this with some encoding.

dizcology commented 3 months ago

@nittonfemton Are there additional details in the error? There might be an "error detail" or "error info" field.

nittonfemton commented 3 months ago

Well, I've picked up on this and tried some more.

Filename & length is not related to this at all, I think. Batch processing worked fine with the original filenames, but only on the own project bucket.

Processing a single document with client.process_document() works fine (to my surprise) on the cross-project bucket. Batch processing on the same single document did not work, however, when the document was on the cross-project bucket.

The code I use for error reporting is this:

    operation = doc_client.batch_process_documents(request)
    try:
        print(operation.result(timeout=60))
    except Exception as e:
        print (e.message)
        print (e.code)

Results in: Failed to process all documents. 400

Thank you for taking the time.

nittonfemton commented 3 months ago

Feeling just a little bit stupid now that I asked you about filenames and whatnot. Of course it was a permission issue.

For anyone else stumbling into this, check your first bucket for the principal which have the "DocumentAI Core Service Agent" role. Copy the principal and add it with the same role to the cross-platform bucket.