shedloadofcode / shedloadofcode-comments

Comments repository for shedloadofcode.com - for use with Utterances
0 stars 0 forks source link

blog/searching-for-text-in-pdfs-at-increasing-scale #7

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

Searching for text in PDFs at increasing scale | Shedload Of Code

Explore multiple approaches to extract and search text from PDFs at increasing scale using Python with PyPDF2, C# with iTextSharp alongside C++ and pdftotext.

https://www.shedloadofcode.com/blog/searching-for-text-in-pdfs-at-increasing-scale/

ArasOrhan commented 1 year ago

Great article, many thanks, it helped me a lot.

Two things I noticed. For one, pypdf2 has fully deprecated some of the functions that you have used. This is unfortunate with pypdf2, this post isn't even two years old, and it doesn't run as is anymore. Second, with pypdf2 I wasn't able to attain anything close to the runtimes that you have experienced, so the configuration is probably very important. I was able to search through 60pdf each with 50-60 pages at 370 seconds. This is with a windows notebook, 32GBs RAM and AMD Ryzen 7 Pro 4750u.

Thanks again.

shedloadofcode commented 1 year ago

You're welcome ArasOrhan glad it helped you and thanks for pointing this out! You're right, some of these functions are now deprecated. You can find the full migration guide in the documentation. From what I can see in this article the main changes are:

pdf_reader = PyPDF2.PdfReader(filepath) # Formerly PyPDF2.PdfFileReader(filepath)
page = pdf_reader.pages[i] # Formerly pdf_reader.getPage(pageNumber)

I will update the article with these changes very soon. Also I am working on a new article related to text searches within PDFs which you may be interested in 😄 I will update this comment once it's released.