Is there any way to include blank lines when extracting texts?

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.57k stars 659 forks source link

Is there any way to include blank lines when extracting texts? #516

Closed flycattt closed 3 years ago

flycattt commented 3 years ago

Hi, I am using this fabulous library to extract texts from PDFs. In my PDFs, records are separated by blank lines. However, after extracting, I only get one line break as marked in the screenshot. It would get me into trouble parsing the records cuz some records starts without a pattern string. I looked through the manual but didn't find a solution. Much appreciated if you could help me with it!

jsvine commented 3 years ago

Hi! Something like this has been a longstanding request — see, e.g., https://github.com/jsvine/pdfplumber/issues/10 from 2016. I think it's probably time to really try adding this feature! Or at least something useful enough, if not perfect. Thanks for the nudge. In the meantime, there are a few ways you could handle this, though the best approach will depend on your specific PDF. One approach you might try:

Use utils.cluster_objects(page.chars, attr="doctop", tolerance=??), where ?? is an integer slightly less than 2x the line height.
For each list of chars returned by the step above, run utils.extract_text(chars).

Closing this issue due to the similarity to #10, but feel free to continue the discussion here.

abtpltd commented 11 months ago

doctop_clusters: {'text': u'Q', 'object_type': 'char', 'height': Decimal('12.268'), 'upright': 1, 'y1': Decimal('500.152'), 'y0': Decimal('487.884'), 'x0': Decimal('250.765'), 'x1': Decimal('257.541'), 'size': Decimal('15.379'), 'adv': Decimal('6.776'), 'fontname': 'ABCDEE+Segoe UI', 'doctop': Decimal('291.848'), 'bottom': Decimal('304.116'), 'top': Decimal('291.848'), 'width': Decimal('6.776'), 'page_number': 1}, {'text': u'I', 'object_type': 'char', 'height': Decimal('12.268'), 'upright': 1, 'y1': Decimal('500.152'), 'y0': Decimal('487.884'), 'x0': Decimal('257.541'), 'x1': Decimal('259.935'), 'size': Decimal('15.379'), 'adv': Decimal('2.394'), 'fontname': 'ABCDEE+Segoe UI', 'doctop': Decimal('291.848'), 'bottom': Decimal('304.116'), 'top': Decimal('291.848'), 'width': Decimal('2.394'), 'page_number': 1}

what value of x_tolerance=3 and y_tolerance=3 ?? i need to put for export blank line. page.extract_text(x_tolerance=3, y_tolerance=10)

jsvine commented 11 months ago

@flycattt Try using page.extract_text(layout=True); you probably do not need to specify the x_tolerance or y_tolerance parameters.

abtpltd commented 11 months ago

i m using python 2.7 and pdfplumber version_info = (0, 5, 11) Now its crashing: Pls Help. page = pdf.pages[0] bounding_box = (d['x1'], d['y1'], d['x2'],d['y2']) crop_area = page.crop(bounding_box) print crop_area.extract_text(layout=True)

jsvine commented 11 months ago

That's a very old version of pdfplumber (and of Python). I suggest updating.