jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

Detecting paragraphs or blank lines inside a table #733

Closed Cristishor201 closed 1 year ago

Cristishor201 commented 1 year ago

So there are these questions on stackoverflow: pdfplumber - How to extract table with no horizontal lines? - this is mine Use pdfplumber to extract paragraphs - this one is similar

But I will repost it again here... So my pdf looks like this: cgsPy

As you can see I don't have horizontal lines inside the table. And I need some sort of parameter or something to split the data from second column like: ['PRODUCT 1\ndescription line 1\ndescription line 2', 'PRODUCT 2\ndescription line 1', 'PRODUCT 3\ndescription line 1\ndescription line 2'] - on vertical extraction ( I jumped over the other columns)

or [['1', 'PRODUCT 1\ndescription line 1\ndescription line 2', 'BUC', '1', '35.00', '35.00', '6.65'], ['2', 'PRODUCT 2\ndescription line 1', 'buc', '1', '7.00', '7.00', '1.33'], ['3', 'PRODUCT 3\ndescription line 1\ndescription line 2', 'buc', '1', '31.00', '31.00', '5.89']] - on horizontal extraction

On the image, I put some red rectangles to understand where should split.

Cristishor201 commented 1 year ago

Possible duplicate of #122