Closed cidrugHug8 closed 1 year ago
Hi there, I'm sorry for the PDF loading issue you encountered. Could you please confirm whether you used a protected PDF file? This information will help me better understand the problem and provide you with the appropriate solution. Thank you!
UPDATE: I tested with a password-protected PDF file, and it failed to process. I will figure out how to resolve this issue.
Thank you for your quick reply. PDF file is not encrypted. pdfinfo results are as follows.
$ pdfinfo Engineer\ Reference.pdf
Title: My Document
Subject:
Keywords:
Author: yamada
Producer: madbuild
CreationDate: Wed Jun 22 11:02:51 2022 JST
ModDate: Wed Jun 22 11:02:51 2022 JST
Custom Metadata: no
Metadata Stream: no
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 175
Encrypted: no
Page size: 595.28 x 841.89 pts (A4)
Page rot: 0
File size: 25883386 bytes
Optimized: no
PDF version: 1.4
same i get this error: PDF.js v2.9.359 (build: e667c8cbc) Message: Invalid PDF structure.
but not all pdfs, if the pdf is a bit large this happens
same i get this error: PDF.js v2.9.359 (build: e667c8cbc) Message: Invalid PDF structure.
but not all pdfs, if the pdf is a bit large this happens
Yes, I am currently using pdf-parse as a document loader, but it cannot handle large files. So, I am trying to set up a custom loader that will split large PDFs into smaller ones and then feed them to pdf-parse.
I have the same problem and splitting the large file into multiple 25-pages sections works.
So for anyone who has the same issue, for now you could try PyPDF2 in Python to split a PDF into separate files:
import PyPDF2
pdf_path = "path/to/your/pdf.pdf" pages_per_file = 25
with open(pdf_path, "rb") as file: reader = PyPDF2.PdfReader(file) total_pages = len(reader.pages)
file_number = 1
page_count = 0
writer = PyPDF2.PdfWriter()
for page_number in range(total_pages):
writer.add_page(reader.pages[page_number])
page_count += 1
if page_count == pages_per_file or page_number == total_pages - 1:
output_filename = f"output_file_{file_number}.pdf"
with open(output_filename, "wb") as output_file:
writer.write(output_file)
# Reset the page count and create a new writer for the next file
page_count = 0
file_number += 1
writer = PyPDF2.PdfWriter()
Hey guys, I have created a custom PDF loader on v0.0.12. I hope it resolves the issue with large PDF files. Please try the latest version and let me know.
Note that the PDF loader still can't load protected PDF files.
I'm able to upload PDFs with thousand of pages with v0.0.12. Fixed it for me
Closing this issue based on the comment above. Feel free to reopen if the problem still exists. Thank you.
Hi, The following error was output to the docker log.