Open Dragon2fly opened 11 months ago
You can try modifying the default text extraction options e.g.
page.extract_table(dict(text_vertical_ttb=False))
[['Sl.\nNo.',
'District',
'Population\n2012-13\nlakhs)\nProjected\n(In\nfor',
'88%\nto lakhs)\nAdult\nEquivalent\n(In',
'400gms/adult/day)\nConsumption\ntonnes)\nrequirement\nLakh\nTotal\n(In\n(@',
'Requirement seeds, wastage) tonnes)\n(Including\nLakh\n&\nfeeds\nTotal\n(In',
'Production (Rice)\n(In Lakh tonnes)',
None,
None,
'Surplus/Defi cit\n(In Lakh\ntonnes)',
None],
[None,
None,
None,
None,
None,
None,
'Kharif',
'Rabi',
'Total',
'Rice',
'Paddy']]
...
Hi @cmdlineluser
Thank you for your suggestion. It worked!
But I don't see the param text_vertical_ttb
mentioned anywhere in the README.md.
Are you planning to turn it on/off this feature automatically?
They are mentioned in the description of the .extract_words()
method.
The parameters
horizontal_ltr
andvertical_ttb
indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words).
With regards to plans, I'm just a fellow pdfplumber user.
That would probably be a question for @jsvine
Thanks for your help here, @cmdlineluser!
@Dragon2fly, it's helpful to hear your confusion. To know about text_vertical_ttb
, you would have had to jump between a few different parts of the README.md file. I'll aim to add better documentation of the text-related methods soon.
Are you planning to turn it on/off this feature automatically?
I don't plan on making any major changes to this parameter or its availability. Does that answer your question, or have I misunderstood it?
Hi @jsvine,
I don't plan on making any major changes to this parameter or its availability. Does that answer your question, or have I misunderstood it?
From a user experience perspective, the fewer parameters that need to be configured the better. So I just wonder if there is a way to detect the text orientation and just extract it correctly.
Anyway, even though the text_vertical_ttb
did help reverse the text correctly,
but for the multi-line vertical text, the output still messed up text from different lines: Population\n2012-13\nlakhs)\nProjected\n(In\nfor
The correct one should be Projected Population\nfor 2012-13\n(In lakhs)
.
I tried use_text_flow=True
but it didn't help either.
Any suggestion?
@Dragon2fly Thank you for clarifying. At the moment, adding automatic text-direction detection isn't on my roadmap, due to the likely large number of edge-cases, and my preference to keep extraction "predictable" and parameters explicit. But I appreciate the suggestion and will keep your use-case in mind.
Re. lines merging: Try decreasing the text_y_tolerance
setting to 0
(or even a negative number). Does that help?
Hi @jsvine. Thanks for your suggestion.
But setting text_y_tolerance
to 0
or -1
didn't help.
There should be other ways to solve this problem.
Thank you @Dragon2fly. Looking into this, there may be a bug in how pdfplumber
handles bottom-to-top text. I will investigate and hope to find a fix.
this rhymes with https://github.com/jsvine/pdfplumber/issues/942
going to work on it
Describe the bug
Table extraction with vertical header texts returned unreadable string or reverted order.
Have you tried repairing the PDF?
Yes. The problem is still there
Code to reproduce the problem
PDF file
agstat.pdf
Expected behavior
The vertical text in the red box should be extracted correctly.
Actual behavior
It returned unreadable text for the first row:
And returned reversed text of the second row
Screenshots
The table outline is still detected correctly
Environment