jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

automatically make this space read as "" #709

Closed yihaoshumi closed 2 years ago

yihaoshumi commented 2 years ago

sample.pdf pdf=pdf.open('sample.pdf') words=pdf.pages[0].extract_words()

How can I automatically make this space read as "". Because this place is actually occupied.

image

[{'text': 'text1text2text3', 'x0': 90.0, 'x1': 166.30090000000004, 'top': 78.38544999999999, 'doctop': 78.38544999999999, 'bottom': 88.83544999999992, 'upright': True, 'direction': 1}, {'text': '11', 'x0': 90.0, 'x1': 101.6477, 'top': 93.98545000000001, 'doctop': 93.98545000000001, 'bottom': 104.43544999999995, 'upright': True, 'direction': 1}, {'text': '22', 'x0': 117.359, 'x1': 129.0067, 'top': 93.98545000000001, 'doctop': 93.98545000000001, 'bottom': 104.43544999999995, 'upright': True, 'direction': 1}, {'text': '33', 'x0': 144.84, 'x1': 156.48770000000002, 'top': 93.98545000000001, 'doctop': 93.98545000000001, 'bottom': 104.43544999999995, 'upright': True, 'direction': 1}, {'text': '34', 'x0': 90.0, 'x1': 101.6477, 'top': 109.58545000000004, 'doctop': 109.58545000000004, 'bottom': 120.03544999999997, 'upright': True, 'direction': 1}, {'text': '35', 'x0': 117.359, 'x1': 129.0067, 'top': 109.58545000000004, 'doctop': 109.58545000000004, 'bottom': 120.03544999999997, 'upright': True, 'direction': 1}, {'text': '37', 'x0': 90.0, 'x1': 101.6477, 'top': 125.18545000000006, 'doctop': 125.18545000000006, 'bottom': 135.63545, 'upright': True, 'direction': 1}, {'text': '38', 'x0': 117.359, 'x1': 129.0067, 'top': 125.18545000000006, 'doctop': 125.18545000000006, 'bottom': 135.63545, 'upright': True, 'direction': 1}, {'text': '39', 'x0': 144.84, 'x1': 156.48770000000002, 'top': 125.18545000000006, 'doctop': 125.18545000000006, 'bottom': 135.63545, 'upright': True, 'direction': 1}]