Open utterances-bot opened 1 year ago
Great article, many thanks, it helped me a lot.
Two things I noticed. For one, pypdf2 has fully deprecated some of the functions that you have used. This is unfortunate with pypdf2, this post isn't even two years old, and it doesn't run as is anymore. Second, with pypdf2 I wasn't able to attain anything close to the runtimes that you have experienced, so the configuration is probably very important. I was able to search through 60pdf each with 50-60 pages at 370 seconds. This is with a windows notebook, 32GBs RAM and AMD Ryzen 7 Pro 4750u.
Thanks again.
You're welcome ArasOrhan glad it helped you and thanks for pointing this out! You're right, some of these functions are now deprecated. You can find the full migration guide in the documentation. From what I can see in this article the main changes are:
pdf_reader = PyPDF2.PdfReader(filepath) # Formerly PyPDF2.PdfFileReader(filepath)
page = pdf_reader.pages[i] # Formerly pdf_reader.getPage(pageNumber)
I will update the article with these changes very soon. Also I am working on a new article related to text searches within PDFs which you may be interested in 😄 I will update this comment once it's released.
Searching for text in PDFs at increasing scale | Shedload Of Code
Explore multiple approaches to extract and search text from PDFs at increasing scale using Python with PyPDF2, C# with iTextSharp alongside C++ and pdftotext.
https://www.shedloadofcode.com/blog/searching-for-text-in-pdfs-at-increasing-scale/