InvalidArgument: 400 Unsupported input file format.

reema93jain commented 6 months ago

Hi Team,

I am trying to extract text using Document AI from a pdf file stored in Google cloud storage bucket.

I am able to extract text when I process pdf on google console. However, when I am running below python code, I am getting error as 'InvalidArgument: 400 Unsupported input file format'

Code below:

from google.api_core.client_options import ClientOptions
from google.cloud import documentai  # type: ignore
from google.cloud import documentai_v1 as documentai
from google.cloud import storage

def quickstart(
    project_id: str,
    location: str,
    gcs_output_uri: str,
    processor_display_name: str = "My Processor",
):

    # Create a processor client
    client = documentai.DocumentProcessorServiceClient()

    parent = client.common_location_path(project_id, location)

    # Create a Processor
    processor = client.create_processor(
        parent=parent,
        processor=documentai.Processor(
            type_="OCR_PROCESSOR",  # Refer to https://cloud.google.com/document-ai/docs/create-processor for how to get available processor types
            display_name=processor_display_name,
        ),
    )

    # Print the processor information
    print(f"Processor Name: {processor.name}")

    # Load binary data
    raw_document = documentai.types.RawDocument(
        content=gcs_output_uri,
        mime_type="application/pdf",  
    )

    # Configure the process request
    request = documentai.types.ProcessRequest(name=processor.name, raw_document=raw_document)

    result = client.process_document(request=request)

    document = result.document

    # Read the text recognition output from the processor
    print("The document contains the following text:")
    print(document.text)

# **Calling the function:**
# GCS URI of the document
gcs_output_uri='gs://pdf_parser_rj/PDF/Winnie_the_Pooh_3_Pages.pdf'

# Calling function. I replaced below with actual project_id
data=quickstart('project_id','us',gcs_output_uri,"process_test")
print(data)

Error:

Can you please advise how I can resolve this issue?

Thank you Reema Jain

reema93jain commented 6 months ago

Hi,

Can someone please guide on above request?

Thank You Reema Jain

parthea commented 6 months ago

Hi @reema93jain,

For ProcessRequest, please set the gcs_document argument instead of the raw_document argument.

From the docs, raw_document is expected to be raw bytes on local storage, whereas gcs_document is a raw document on Google Cloud Storage. https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessRequest

See sample code here if you'd like to use raw bytes from local storage instead of GCS.

The following code worked for me locally using a document from GCS using the gcs_document argument:

from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai

I was able to get the following code to work which sets the `gcs_document` argument of `ProcessRequest` instead of the 
`raw_document` argument. 

def quickstart(
    project_id: str,
    location: str,
    gcs_uri: str,
    processor_display_name: str = "My Processor",
):

    # Create a processor client
    client = documentai.DocumentProcessorServiceClient()

    parent = client.common_location_path(project_id, location)

    # Create a Processor
    processor = client.create_processor(
        parent=parent,
        processor=documentai.Processor(
            type_="OCR_PROCESSOR",  # Refer to https://cloud.google.com/document-ai/docs/create-processor for how to get available processor types
            display_name=processor_display_name,
        ),
    )

    # Print the processor information
    print(f"Processor Name: {processor.name}")

    gcs_document=documentai.GcsDocument(gcs_uri=gcs_uri, mime_type="application/pdf")

    # Configure the process request
    request = documentai.ProcessRequest(name=processor.name, gcs_document=gcs_document)

    result = client.process_document(request=request)

    document = result.document

    # Read the text recognition output from the processor
    print("The document contains the following text:")
    print(document.text)

# **Calling the function:**
# GCS URI of the document
gcs_uri='gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh_3_Pages.pdf'

# Calling function. I replaced below with actual project_id
data=quickstart('project_id','us',gcs_uri,"process_test9")
print(data)

I'm going to close this issue but please feel free to open a new issue if you still encounter an error.

reema93jain commented 6 months ago

Thank you Parthea for the response! I am able to parse text content. However, I see the table contents are not printed correctly and are not in tabular format. Is there any way, I can parse table in tabular format along with other text on pdf file?

googleapis / google-cloud-python

InvalidArgument: 400 Unsupported input file format. #12142