jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.5k stars 658 forks source link

Different LAParams for different zones #893

Closed QuentinAndre11 closed 1 year ago

QuentinAndre11 commented 1 year ago

If I read correctly, PDFPlumber will extract the pages then crop or filter elements. Is there a way to crop before extracting or, even better, to be able to extract different zones of the PDF with different LAParams ? I'd like to avoid extracting multiple times my PDF with different parameters.

jsvine commented 1 year ago

Hi @QuentinAndre11, and thanks for the interesting proposal. Unfortunately, I don't think this will be possible (although I'm open to other people suggesting solutions that I may have overlooked). The LAParams are passed to pdfminer.six when pdfplumber requests the layout for a page:

https://github.com/jsvine/pdfplumber/blob/ae676aeacd958e7b1572f35568e612314d611eff/pdfplumber/page.py#L152-L163

I don't believe that pdfminer.six can apply its layout analysis to just one zone (or cropped section) of a page. (Moreover, the cropping/filtering itself actually depends on the results of the layout analysis.)