jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

How to read paragraph? #654

Closed nianfouyi closed 2 years ago

nianfouyi commented 2 years ago

I want to read paragraph content, but I can't find any way, is there no such way?

jsvine commented 2 years ago

Hi @luozhongxiangsi, and thanks for your interest in this library. A “paragraph” is not a concept defined by the PDF specification, and paragraphs are visually represented in different ways in different PDFS, so there’s no consistent/reliable way to identify them. However, you may be able to achieve some of your goals using page.extract_text(layout=True, …). (See the documentation for .extract_text in the README for more details.)