jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Option to extract text of cells with leading and trailing white spaces preserved #277

Open AysegulKarcili opened 4 years ago

AysegulKarcili commented 4 years ago

Hi! First of all, thanks a lot for this library since it is excellent for extracting tables, being very easy and intuitive to use as well.

I am working on tables which are too long and split to multiple pages. Splits sometimes occurs between two table rows, but sometimes occurs in the middle of one table row. In the second case, multi-line text of that row is divided into two pages. While texts of cells wrapping to become multi-line, the division occurs sometimes between the tokens and sometimes in a single token. When I extract texts of each cell using pdfplumber, currently the text comes as stripped from white spaces. So I should put a white space between parts while merging if the division occurred between tokens and I should not put if the division occurred in the middle of a single token. However, this is almost impossible to satisfy because there are also some cases with numbers. Therefore I need to get the texts of the cells as it is, with their leading and trailing white spaces, without stripping.

According to my needs, I added this feature to the library, and tested on my task. I will commit and open a pull request. Feedback and improvements are welcome.

Three examples about the issue and their explanations are below.

Example 1: 1st, 3rd, 6th and 7th columns need to be merged with a white space in between. 4th and 8th columns need to be merged directly. Screenshot_20200927_172514

Example 2: Numbers case: The number in 4th column should be merged directly without a white space. Screenshot_20200927_172921

Example 3: Numbers case: The number in 7th column should be merged with a white space in between. Screenshot_20200927_173202