Closed yeus closed 8 months ago
Seems like your problem could be solved with passing None
to boxes_flow
attribute of LaParams()
object as: LAParams(boxes_flow=None)
, conclusion drawn from #411.
Seems like your problem could be solved with passing
None
toboxes_flow
attribute ofLaParams()
object as:LAParams(boxes_flow=None)
, conclusion drawn from #411.
Hi,
Just want to reply that this seems to work quiet well. I haven't removed all the heavy lifting yet, but to give you a better idea, I did some timings:
The following was done with "vanilla" LAParams()
Time taken for page 10: 0.368527889251709 seconds, elements:602
Time taken for page 16: 23.002509593963623 seconds, elements:31153
Time taken for page 17: 14.153702735900879 seconds, elements:31262
Time taken for page 18: 0.3653285503387451 seconds, elements:576
Time taken for page 19: 0.36687445640563965 seconds, elements:596
with: LAParams(detect_vertical=False)
Time taken for page 10: 0.3694620132446289 seconds, elements:602
Time taken for page 16: 22.085140705108643 seconds, elements:31153
Time taken for page 17: 13.33229398727417 seconds, elements:31262
Time taken for page 18: 0.36687588691711426 seconds, elements:576
Time taken for page 19: 0.359846830368042 seconds, elements:596
Then with LAParams(boxes_flow=None):
Time taken for page 10: 0.3904867172241211 seconds, elements:602
Time taken for page 16: 3.1319127082824707 seconds, elements:31153
Time taken for page 17: 3.055600643157959 seconds, elements:31262
Time taken for page 18: 0.248185396194458 seconds, elements:576
Time taken for page 19: 0.24678397178649902 seconds, elements:596
LAParams(detect_vertical=False, boxes_flow=None),
Time taken for page 10: 0.3619980812072754 seconds, elements:602
Time taken for page 16: 3.106546640396118 seconds, elements:31153
Time taken for page 17: 3.0158188343048096 seconds, elements:31262
Time taken for page 18: 0.24485182762145996 seconds, elements:576
Time taken for page 19: 0.2494044303894043 seconds, elements:596
so the improvement is quiet big. detect_vertical=False
also seems to have minimal influence on the efficiency.
Any other ideas how we could speed this up :)?
Is there a way to to prevent pdfminer.six from executing the layout algorithm? So that one only gets a list of lines/graphics/image elements etc.. I have several PDFs where the layout algorithm takes a loooooong time. Simply because there are so many tables & textboxes distributed all over it. Also check this issue: https://github.com/euske/pdfminer/issues/61
But as I don't need the layout algorithm. It would be sufficient for me to simply iterate over the page without textboxes..
how would I do that? right now I am using this: https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html#extract-pages
Is it somehow possible with this here?: https://pdfminersix.readthedocs.io/en/latest/tutorial/composable.html