Open holtskinner opened 8 months ago
Running this code for testing:
import timeit
from google.cloud.documentai_toolbox import document
document_json_path = "documentai_SampleDocuments_PROCUREMENT_DOCUMENT_SPLIT_PROCESSOR_pretrained-procurement-splitter-v1.2-2022-08-19_output.json"
document_pdf_path = "documentai_SampleDocuments_PROCUREMENT_DOCUMENT_SPLIT_PROCESSOR_procurement_multi_document.pdf"
doc = document.Document.from_document_path(document_json_path)
output_path = "/"
# Test the PikePDF function
pikepdf_time = timeit.timeit(lambda: doc.split_pdf(document_pdf_path, output_path), number=10)
# Test the PyMuPDF function
mupdf_time = timeit.timeit(lambda: doc.split_pdf_mupdf(document_pdf_path, output_path), number=10)
print(f"PikePDF Time: {pikepdf_time} seconds")
print(f"PyMuPDF Time: {mupdf_time} seconds")
print(f"difference is {pikepdf_time-mupdf_time} seconds")
Got this result
python pymupdf_test.py
PikePDF Time: 0.0616944160001367 seconds
PyMuPDF Time: 0.06151633399986167 seconds
difference is 0.00017808200027502608 seconds
Explore switching PDF Splitter from PikePDF to PyMuPDF
See if efficiency/code readability improves
https://pymupdf.readthedocs.io/en/latest/about.html