jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6k stars 618 forks source link

Extracting table with vertical texts give unreadable result #942

Open Dragon2fly opened 11 months ago

Dragon2fly commented 11 months ago

Describe the bug

Table extraction with vertical header texts returned unreadable string or reverted order.

Have you tried repairing the PDF?

Yes. The problem is still there

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open(r"tests\pdf_samples\camelot\agstat.pdf", repair=True)
p0 = pdf.pages[0]
# im = p0.to_image()
# im.debug_tablefinder()
# im.show()
table = p0.extract_table()
for line in table:
    print(line)

PDF file

agstat.pdf

Expected behavior

The vertical text in the red box should be extracted correctly.

image

Actual behavior

It returned unreadable text for the first row:

['Sl.\nNo.', 'District', 'noitalupoP\n31-2102\n)shkal\ndetcejorP\nnI(\nrof', '%88\not )shkal\ntludA\ntnelaviuqE\nnI(', ')yad/tluda/smg004\nnoitpmusnoC\n)sennot\ntnemeriuqer\nhkaL\nlatoT\nnI(\n@(', 'tnemeriuqeR ,sdees )egatsaw )sennot\ngnidulcnI(\nhkaL\n&\nsdeef\nlatoT\nnI(', 'Production (Rice)\n(In Lakh tonnes)', None, None, 'Surplus/Defi cit\n(In Lakh\ntonnes)', None]

And returned reversed text of the second row

[None, None, None, None, None, None, 'firahK', 'ibaR', 'latoT', 'eciR', 'yddaP']

Screenshots

The table outline is still detected correctly

image

Environment

cmdlineluser commented 11 months ago

You can try modifying the default text extraction options e.g.

page.extract_table(dict(text_vertical_ttb=False))
[['Sl.\nNo.',
  'District',
  'Population\n2012-13\nlakhs)\nProjected\n(In\nfor',
  '88%\nto lakhs)\nAdult\nEquivalent\n(In',
  '400gms/adult/day)\nConsumption\ntonnes)\nrequirement\nLakh\nTotal\n(In\n(@',
  'Requirement seeds, wastage) tonnes)\n(Including\nLakh\n&\nfeeds\nTotal\n(In',
  'Production (Rice)\n(In Lakh tonnes)',
  None,
  None,
  'Surplus/Defi cit\n(In Lakh\ntonnes)',
  None],
 [None,
  None,
  None,
  None,
  None,
  None,
  'Kharif',
  'Rabi',
  'Total',
  'Rice',
  'Paddy']]
...
Dragon2fly commented 11 months ago

Hi @cmdlineluser

Thank you for your suggestion. It worked! But I don't see the param text_vertical_ttb mentioned anywhere in the README.md. Are you planning to turn it on/off this feature automatically?

cmdlineluser commented 11 months ago

They are mentioned in the description of the .extract_words() method.

The parameters horizontal_ltr and vertical_ttb indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words).

With regards to plans, I'm just a fellow pdfplumber user.

That would probably be a question for @jsvine

jsvine commented 11 months ago

Thanks for your help here, @cmdlineluser!

@Dragon2fly, it's helpful to hear your confusion. To know about text_vertical_ttb, you would have had to jump between a few different parts of the README.md file. I'll aim to add better documentation of the text-related methods soon.

Are you planning to turn it on/off this feature automatically?

I don't plan on making any major changes to this parameter or its availability. Does that answer your question, or have I misunderstood it?

Dragon2fly commented 11 months ago

Hi @jsvine,

I don't plan on making any major changes to this parameter or its availability. Does that answer your question, or have I misunderstood it?

From a user experience perspective, the fewer parameters that need to be configured the better. So I just wonder if there is a way to detect the text orientation and just extract it correctly.

Anyway, even though the text_vertical_ttb did help reverse the text correctly, but for the multi-line vertical text, the output still messed up text from different lines: Population\n2012-13\nlakhs)\nProjected\n(In\nfor

The correct one should be Projected Population\nfor 2012-13\n(In lakhs). I tried use_text_flow=True but it didn't help either.

Any suggestion?

jsvine commented 11 months ago

@Dragon2fly Thank you for clarifying. At the moment, adding automatic text-direction detection isn't on my roadmap, due to the likely large number of edge-cases, and my preference to keep extraction "predictable" and parameters explicit. But I appreciate the suggestion and will keep your use-case in mind.

Re. lines merging: Try decreasing the text_y_tolerance setting to 0 (or even a negative number). Does that help?

Dragon2fly commented 11 months ago

Hi @jsvine. Thanks for your suggestion. But setting text_y_tolerance to 0 or -1 didn't help. There should be other ways to solve this problem.

jsvine commented 11 months ago

Thank you @Dragon2fly. Looking into this, there may be a bug in how pdfplumber handles bottom-to-top text. I will investigate and hope to find a fix.

afriedman412 commented 8 months ago

this rhymes with https://github.com/jsvine/pdfplumber/issues/942

going to work on it