Performance issues when integrating pdfplumber in Scrapy

With Scrapy I am encountering some pdfs from which I want to extract the text. Prior to implementing pdfplumber in my scraper, I checked the performance by running a script similar to this:

import requests
import pdfplumber
import io

response = requests.get("https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")
reader = pdfplumber.open(io.BytesIO(response.content))
text = " ".join([" ".join(page.extract_text().split()) for page in  reader.pages])

While running the above snippet in the Spyder IDE, %timeit timed the execution time to be less than one second (also for pdf's of about 30 pages). Yet, when I inserted the same code into the def parse() function of my scraper (see code below), it took about 3 minutes to extract the text from the pdf. This slow performance could not be due to it being extracting other pdf's at the same time, as I only instructed it to scrape just this one pdf. What I noted when running the Scraper from the terminal, that the pdfplumber (or actually pdfminer.psparser) was printing a lot in the terminal (several lines for every token). When I instructed Scrapy to reduce the printing to the terminal (log_level='CRITICAL'), the performance visibly improved, but it took still more than 30 seconds instead of the 1 second that was achieved in the IDE. Is there something I could do to achieve the below one second performance for decoding the file?

def parse(self, response): reader = pdfplumber.open(io.BytesIO(response.body)) text = " ".join([" ".join(page.extract_text().split()) for page in reader.pages]) yield{"text":text}

jsvine / pdfplumber

Performance issues when integrating pdfplumber in Scrapy #303