Open reema93jain opened 6 months ago
Hi,
Can someone please guide on above request?
Thank You Reema Jain
Hi @reema93jain,
For ProcessRequest
, please set the gcs_document
argument instead of the raw_document
argument.
From the docs, raw_document
is expected to be raw bytes on local storage, whereas gcs_document
is a raw document on Google Cloud Storage.
https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessRequest
See sample code here if you'd like to use raw bytes from local storage instead of GCS.
The following code worked for me locally using a document from GCS using the gcs_document
argument:
from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai
I was able to get the following code to work which sets the `gcs_document` argument of `ProcessRequest` instead of the
`raw_document` argument.
def quickstart(
project_id: str,
location: str,
gcs_uri: str,
processor_display_name: str = "My Processor",
):
# Create a processor client
client = documentai.DocumentProcessorServiceClient()
parent = client.common_location_path(project_id, location)
# Create a Processor
processor = client.create_processor(
parent=parent,
processor=documentai.Processor(
type_="OCR_PROCESSOR", # Refer to https://cloud.google.com/document-ai/docs/create-processor for how to get available processor types
display_name=processor_display_name,
),
)
# Print the processor information
print(f"Processor Name: {processor.name}")
gcs_document=documentai.GcsDocument(gcs_uri=gcs_uri, mime_type="application/pdf")
# Configure the process request
request = documentai.ProcessRequest(name=processor.name, gcs_document=gcs_document)
result = client.process_document(request=request)
document = result.document
# Read the text recognition output from the processor
print("The document contains the following text:")
print(document.text)
# **Calling the function:**
# GCS URI of the document
gcs_uri='gs://cloud-samples-data/documentai/codelabs/ocr/Winnie_the_Pooh_3_Pages.pdf'
# Calling function. I replaced below with actual project_id
data=quickstart('project_id','us',gcs_uri,"process_test9")
print(data)
I'm going to close this issue but please feel free to open a new issue if you still encounter an error.
Thank you Parthea for the response! I am able to parse text content. However, I see the table contents are not printed correctly and are not in tabular format. Is there any way, I can parse table in tabular format along with other text on pdf file?
Hi Team,
I am trying to extract text using Document AI from a pdf file stored in Google cloud storage bucket.
I am able to extract text when I process pdf on google console. However, when I am running below python code, I am getting error as 'InvalidArgument: 400 Unsupported input file format'
Code below:
Error:
![image](https://github.com/googleapis/google-cloud-python/assets/113460610/89a8eea3-016c-4372-ad72-9adaf9aca449)
Can you please advise how I can resolve this issue?
Thank you Reema Jain