jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Performance issues when integrating pdfplumber in Scrapy #303

Closed niwreg-coder closed 3 years ago

niwreg-coder commented 3 years ago

With Scrapy I am encountering some pdfs from which I want to extract the text. Prior to implementing pdfplumber in my scraper, I checked the performance by running a script similar to this:

import requests
import pdfplumber
import io

response = requests.get("https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")
reader = pdfplumber.open(io.BytesIO(response.content))
text = " ".join([" ".join(page.extract_text().split()) for page in  reader.pages])

While running the above snippet in the Spyder IDE, %timeit timed the execution time to be less than one second (also for pdf's of about 30 pages). Yet, when I inserted the same code into the def parse() function of my scraper (see code below), it took about 3 minutes to extract the text from the pdf. This slow performance could not be due to it being extracting other pdf's at the same time, as I only instructed it to scrape just this one pdf. What I noted when running the Scraper from the terminal, that the pdfplumber (or actually pdfminer.psparser) was printing a lot in the terminal (several lines for every token). When I instructed Scrapy to reduce the printing to the terminal (log_level='CRITICAL'), the performance visibly improved, but it took still more than 30 seconds instead of the 1 second that was achieved in the IDE. Is there something I could do to achieve the below one second performance for decoding the file?

def parse(self, response): reader = pdfplumber.open(io.BytesIO(response.body)) text = " ".join([" ".join(page.extract_text().split()) for page in reader.pages]) yield{"text":text}

jsvine commented 3 years ago

Hi @niwreg-coder, and thanks for your interest in this library. I don't know much about Scrapy, and don't see any code here that would let me reproduce the issue, but it sounds like Scrapy might be setting the log level for all modules. Perhaps explicitly changing the log level for pdfminer would help, via https://github.com/jsvine/pdfplumber/issues/251#issuecomment-690157905:

import logging
logging.getLogger("pdfminer").setLevel(logging.WARNING)

Let me know if that fixes the Scrapy problem.

If that doesn't help, please attach a code sample that can reproduce the issue you're running into.