Closed flycattt closed 3 years ago
Hi! Something like this has been a longstanding request — see, e.g., https://github.com/jsvine/pdfplumber/issues/10 from 2016. I think it's probably time to really try adding this feature! Or at least something useful enough, if not perfect. Thanks for the nudge. In the meantime, there are a few ways you could handle this, though the best approach will depend on your specific PDF. One approach you might try:
utils.cluster_objects(page.chars, attr="doctop", tolerance=??)
, where ??
is an integer slightly less than 2x the line height.utils.extract_text(chars)
.Closing this issue due to the similarity to #10, but feel free to continue the discussion here.
doctop_clusters: {'text': u'Q', 'object_type': 'char', 'height': Decimal('12.268'), 'upright': 1, 'y1': Decimal('500.152'), 'y0': Decimal('487.884'), 'x0': Decimal('250.765'), 'x1': Decimal('257.541'), 'size': Decimal('15.379'), 'adv': Decimal('6.776'), 'fontname': 'ABCDEE+Segoe UI', 'doctop': Decimal('291.848'), 'bottom': Decimal('304.116'), 'top': Decimal('291.848'), 'width': Decimal('6.776'), 'page_number': 1}, {'text': u'I', 'object_type': 'char', 'height': Decimal('12.268'), 'upright': 1, 'y1': Decimal('500.152'), 'y0': Decimal('487.884'), 'x0': Decimal('257.541'), 'x1': Decimal('259.935'), 'size': Decimal('15.379'), 'adv': Decimal('2.394'), 'fontname': 'ABCDEE+Segoe UI', 'doctop': Decimal('291.848'), 'bottom': Decimal('304.116'), 'top': Decimal('291.848'), 'width': Decimal('2.394'), 'page_number': 1}
what value of x_tolerance=3 and y_tolerance=3 ?? i need to put for export blank line. page.extract_text(x_tolerance=3, y_tolerance=10)
@flycattt Try using page.extract_text(layout=True)
; you probably do not need to specify the x_tolerance
or y_tolerance
parameters.
i m using python 2.7 and pdfplumber version_info = (0, 5, 11) Now its crashing: Pls Help. page = pdf.pages[0] bounding_box = (d['x1'], d['y1'], d['x2'],d['y2']) crop_area = page.crop(bounding_box) print crop_area.extract_text(layout=True)
That's a very old version of pdfplumber
(and of Python). I suggest updating.
Hi, I am using this fabulous library to extract texts from PDFs. In my PDFs, records are separated by blank lines. However, after extracting, I only get one line break as marked in the screenshot. It would get me into trouble parsing the records cuz some records starts without a pattern string. I looked through the manual but didn't find a solution. Much appreciated if you could help me with it!