jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

extract_words() slower when fewer extra_attrs are passed #484

Closed hadikoub closed 3 years ago

hadikoub commented 3 years ago

Discussed in https://github.com/jsvine/pdfplumber/discussions/483

Originally posted by **hadikoub** July 28, 2021 The idea is that I'm trying to find Bold and Blank sections in a PDF file so I was experimenting with `extract_words()` function to be able to group sections based on the font family. I found a way to extract Bold text by grouping sections by font name and size and then finding Bold font family ``` sections = page.extract_words(keep_blank_chars=True, extra_attrs=["fontname", "size"]) ``` and as a similar approach, I did the same for grouping sections to find blanks in between them ``` sections = page.extract_words(keep_blank_chars=True, extra_attrs=[ "size"]) ``` But the issue I faced is a big gap in performance between the 2 methods: - **using extra_attrs=["fontname", "size"]** `sections = page.extract_words(keep_blank_chars=True, extra_attrs=["fontname", "size"])` **line execution time avg: 0.5 Sec** - **using extra_attrs=[ "size"]** `sections = page.extract_words(keep_blank_chars=True, extra_attrs=[ "size"])` **line execution time avg: 5.2 Sec** Knowing that both statement are using the same page. Also, I noticed when adding more attributes that render the response larger like attr="adv" it reduces the execution speed furthermore at 22.7ms per page why does a statement of `extract_words()` with more filters outperformed the second statement having fewer filters? and is there any way to improve the speed of the second statement?
jsvine commented 3 years ago

To avoid confusion/duplication, I'm closing this issue in favor of using Discussion #483 to discuss.