googleapis / python-documentai-toolbox

Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from JSON files in Cloud Storage, local JSON files, or output directly from the Document AI API.
https://cloud.google.com/document-ai/docs/toolbox
Apache License 2.0
33 stars 14 forks source link

Refactor: Explore replacing PikePDF with PyMuPDF for efficiency #252

Open holtskinner opened 8 months ago

holtskinner commented 8 months ago

Explore switching PDF Splitter from PikePDF to PyMuPDF

See if efficiency/code readability improves

https://pymupdf.readthedocs.io/en/latest/about.html

holtskinner commented 8 months ago

Running this code for testing:

import timeit

from google.cloud.documentai_toolbox import document

document_json_path = "documentai_SampleDocuments_PROCUREMENT_DOCUMENT_SPLIT_PROCESSOR_pretrained-procurement-splitter-v1.2-2022-08-19_output.json"
document_pdf_path = "documentai_SampleDocuments_PROCUREMENT_DOCUMENT_SPLIT_PROCESSOR_procurement_multi_document.pdf"

doc = document.Document.from_document_path(document_json_path)

output_path = "/"

# Test the PikePDF function
pikepdf_time = timeit.timeit(lambda: doc.split_pdf(document_pdf_path, output_path), number=10)

# Test the PyMuPDF function
mupdf_time = timeit.timeit(lambda: doc.split_pdf_mupdf(document_pdf_path, output_path), number=10)

print(f"PikePDF Time: {pikepdf_time} seconds")
print(f"PyMuPDF Time: {mupdf_time} seconds")

print(f"difference is {pikepdf_time-mupdf_time} seconds")

Got this result

python pymupdf_test.py
PikePDF Time: 0.0616944160001367 seconds
PyMuPDF Time: 0.06151633399986167 seconds
difference is 0.00017808200027502608 seconds