Open gnadlr opened 1 year ago
Thanks for opening this issue, @gnadlr; and thanks for your contributions to the related discussion and other recent ones, @cmdlineluser! Some observations:
'aaaa bbbb' and '1111' do not seem to appear in the sample PDF, so it's difficult to diagnose the situation precisely. Are you able to share the PDF that contains those strings? Or, alternatively, restate the problem in terms of the sample PDF that you've shared above?
Your issue here and the sample PDF also helped me to diagnose a bug in the way pdfplumber
handles use_text_flow=True
. Hoping to push a fix for this soon, and hoping it helps you with your PDF.
In the sample PDF attached, there is an interesting quirk: It uses a "clipping path" to make the text that overflows certain cells invisible. If you're curious, it's this part (and another like it) in the raw PDF commands:
You can check out these 2 sample pdf: 1a.pdf 2a.pdf
Issue with extract_tables(): text mingling. It is very apparent in 2nd pdf.
Issue with extract_words(): not recognizing space properly between the last word of a column and the first word of the next column. It is very apparent in the 2nd pdf (column overlapping) and also in the 1st pdf (column not overlap but words of 2 columns are very close).
Apologies for any confusion regarding this @jsvine
https://drive.google.com/file/d/1SlQAkQ7W28O6mZXvKx2Lvw__DjdhnX82/view
Using row 1 from your update page 2 sample as an example:
page2.search('Desloratadin.*')[0]['text']
'Desloratadin Uống 0,5mg/ml x 60 mlDestacure VN-16773-13VN-16773-13
Gracure PharmIancdeiuatical LtHd ộp 1 lọ 60 ml Chai 4.120 65.000 267.800.000
Công ty TNNH2H Dược phGẩ1m 1A ViệBt NVa Đma khoaB tắỉnch K Bạắnc
K1ạn69/QĐ-SY0T4/3/2022
'
With use_text_flow=True
it fixes some overlap, e.g. Pharmaceutical
page2.search('Desloratadin.*', use_text_flow=True)[0]['text']
'Desloratadin Uống 0,5mg/ml x 60 mlDestacure VN-16773-13VN-16773-13
Gracure Pharmaceutical India Ltd Hộp 1 lọ 60 ml Chai 4.120 65.000 267.800.000
Công ty TNHH NamN2 Dược phẩm G1 1A Việt BV Đa khoa KạnBắc tỉnh Kạn Bắc
169/QĐ-SYT04/3/2022
'
keep_blank_chars=True
seems to fix some more e.g. Việt Nam
(although it N2
is merged and there is a space)
page2.search('Desloratadin.*', use_text_flow=True, keep_blank_chars=True)[0]['text']
'Desloratadin Uống 0,5mg/ml x 60 mlDestacure VN-16773-13VN-16773-13
Gracure Pharmaceutical Ltd India Hộp 1 lọ 60 ml Chai 4.120 65.000 267.800.000
Công ty TNHH Dược phẩm 1A Việt NamN2 G1 BV Đa khoa tỉnh Bắc KạnBắc Kạn
169/QĐ-SYT04/3/2022
'
I'm not sure if all the text there is now correct? There are spacing issues e.g. Việt NamN2
It does appear you can pass text_use_text_flow
/text_use_blank_chars
to the table methods: https://github.com/jsvine/pdfplumber/issues/764#issuecomment-1433107340 (I did not realize this was possible)
pd.DataFrame(page2.extract_table()).iloc[[1], :10]
# 0 1 2 3 4 5 6 7 8 9
# 1 1 Desloratadin Uống 0,5mg/ml x 60 m lDestacure VN-16773-13 VN-16773-13 Gracure Phar mIancdeiuatical L tHd ộp 1 lọ 60 ml
pd.DataFrame(page2.extract_table({"text_use_text_flow": True})).iloc[[1], :10]
# 0 1 2 3 4 5 6 7 8 9
# 1 1 Desloratadin Uống 0,5mg/ml x 60 m lDestacure VN-16773-13 VN-16773-13 Gracure Phar maceutical LIndia td Hộp 1 lọ 60 ml
There does seem to be something else going on though.
@cmdlineluser is on point. Here are some more comparisons so it is easier to see the issue. Keeping on with the example (10 columns of 1st row of pdf sample 2), we have the following:
# 0 1 2 3 4 5 6 7 8 9
# 1 1 Desloratadin Uống 0,5mg/ml x 60 ml Destacure VN-16773-13 VN-16773-13 Gracure Pharmaceutical Ltd India Hộp 1 lọ 60 ml
Text is split where there is overlapping, the split text merges with the next column (lDestacure) and mingled (Gracure Pharmaceutical Ltd becomes 3 columns Gracure Phar | mIancdeiuatical L | tHd)
# 0 1 2 3 4 5 6 7 8 9
# 1 1 Desloratadin Uống 0,5mg/ml x 60 m lDestacure VN-16773-13 VN-16773-13 Gracure Phar mIancdeiuatical L tHd ộp 1 lọ 60 ml
Same split behavior (lDestacure). Texts mingling is better but still not correct (Gracure Pharmaceutical Ltd becomes 3 columns Gracure Phar | maceutical LIndia | td).
# 0 1 2 3 4 5 6 7 8 9
# 1 1 Desloratadin Uống 0,5mg/ml x 60 m lDestacure VN-16773-13 VN-16773-13 Gracure Phar maceutical LIndia td Hộp 1 lọ 60 ml
All texts are correct (no mingling) but several spaces are not detected so it's not possible to parse correctly into tables (spaces are not detected when texts of adjacent columns are too close or when columns overlap)
1Desloratadin Uống 0,5mg/ml x 60 mlDestacure VN-16773-13VN-16773-13 Gracure Pharmaceutical Ltd India Hộp 1 lọ 60 ml
I was able to extract individual characters with their coordinates with extract_text_lines(), The automatic line detection of extract_text_lines() sometimes detect incorrectly so I have to merge all characters into a single list and write another parser to sort them into rows.
Now the only thing left to do is to parse it into columns.
def pdfplumber_extractchars():
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
chars = []
data = []
text = page.extract_text_lines(use_text_flow=True, keep_blank_chars=True, x_tolerance=1, return_chars=True)
# Merge all rows
for line in text:
for char in line['chars']:
chars.append(char)
# Parse into rows
row = [chars[0]]
for i in range(1, len(chars)):
# Detect row change by coordinates:
if chars[i]['x0'] < column_separators[1] and chars[i-1]['x0'] > column_separators[-2]:
data.append(row)
row = [chars[i]]
else:
row.append(chars[i])
data.append(row)
- Your issue here and the sample PDF also helped me to diagnose a bug in the way
pdfplumber
handlesuse_text_flow=True
. Hoping to push a fix for this soon, and hoping it helps you with your PDF.
FYI v0.10.0, now available, contains this fix. Hopefully it helps with this issue more broadly. I'll be eager to know what you think.
Thank you very much for the fix.
Space is now correctly detected when text of 2 columns physically overlap.
However, space is not detected when text of 2 columns is very close but does not overlap.
Using extract_words, this comes out as a single word "mlDestacure" (after "m" is "l", but it is hidden and occupies the empty space before "D"). It should have been "ml" and "Destacure" separately.
Because of the above, my current best option is to extract individual characters. But Im not sure how to fix this based on character coordinates alone, since the coordinates in this case are close and continous very similar to any standalone word.
My idea is pdfplumber could detect which character is "visible" and which is "hidden", then I could write a parser to split words when this attribute change from "hidden" to "visible".
If you have better idea please let me know.
Hope it makes sense. Thank you.
I wonder why Gracure Pharmaceutical
is "hidden" - yet it seems to be parsed correctly?
Trying all the various tools/libraries for extracting text - they all seem to extract mlDestacure
as a single word.
In looking for debugging options, I found the mutool trace
command which appears to translate the raw pdf commands into XML:
<fill_text colorspace="DeviceGray" color="0" transform="1 0 0 -1 0 595.32">
<span font="Times New Roman" wmode="0" bidi="0" trm="6 0 0 6">
<g unicode="0" glyph="zero" x="172.1" y="479.74" adv=".5"/>
<g unicode="," glyph="comma" x="175.1" y="479.74" adv=".25"/>
<g unicode="5" glyph="five" x="176.654" y="479.74" adv=".5"/>
<g unicode="m" glyph="m" x="179.654" y="479.74" adv=".778"/>
<g unicode="g" glyph="g" x="184.09401" y="479.74" adv=".5"/>
<g unicode="/" glyph="slash" x="187.09401" y="479.74" adv=".278"/>
<g unicode="m" glyph="m" x="188.76201" y="479.74" adv=".778"/>
<g unicode="l" glyph="l" x="193.214" y="479.74" adv=".278"/>
<g unicode=" " glyph="space" x="194.654" y="479.74" adv=".25"/>
<g unicode="x" glyph="x" x="196.20801" y="479.74" adv=".5"/>
<g unicode=" " glyph="space" x="199.08802" y="479.74" adv=".25"/>
<g unicode="6" glyph="six" x="200.64202" y="479.74" adv=".5"/>
<g unicode="0" glyph="zero" x="203.64202" y="479.74" adv=".5"/>
<g unicode=" " glyph="space" x="206.64202" y="479.74" adv=".25"/>
<g unicode="m" glyph="m" x="208.19602" y="479.74" adv=".778"/>
<g unicode="l" glyph="l" x="212.63602" y="479.74" adv=".278"/>
</span>
</fill_text>
<pop_clip/>
<end_layer/>
<layer name="P"/>
<clip_path winding="eofill" transform="1 0 0 -1 0 595.32">
<moveto x="51.96" y="59.28"/>
<lineto x="782.62" y="59.28"/>
<lineto x="782.62" y="540.6"/>
<lineto x="51.96" y="540.6"/>
<closepath/>
</clip_path>
<fill_text colorspace="DeviceGray" color="0" transform="1 0 0 -1 0 595.32">
<span font="Times New Roman" wmode="0" bidi="0" trm="6 0 0 6">
<g unicode="D" glyph="D" x="213.26" y="479.74" adv=".722"/>
<g unicode="e" glyph="e" x="217.592" y="479.74" adv=".444"/>
<g unicode="s" glyph="s" x="220.22" y="479.74" adv=".389"/>
<g unicode="t" glyph="t" x="222.5" y="479.74" adv=".278"/>
<g unicode="a" glyph="a" x="224.294" y="479.74" adv=".444"/>
<g unicode="c" glyph="c" x="226.934" y="479.74" adv=".444"/>
<g unicode="u" glyph="u" x="229.574" y="479.74" adv=".5"/>
<g unicode="r" glyph="r" x="232.574" y="479.74" adv=".333"/>
<g unicode="e" glyph="e" x="234.608" y="479.74" adv=".444"/>
</span>
</fill_text>
<pop_clip/>
<end_layer/>
The <clip_path winding="eofill"
entries appear to be the W*
commands @jsvine showed us https://github.com/jsvine/pdfplumber/issues/912#issuecomment-1612127276.
However, mutool
text extraction still extracts mlDestacure
as a single "word" (unless you use the preserve-spans
option)
$ mutool convert -O preserve-spans -o 2a.txt Downloads/2a.pdf
$ grep -C 3 Dest 2a.txt
Desloratadin
Uống
0,5mg/ml x 60 ml
Destacure
VN-16773-13
VN-16773-13
Gracure Pharmaceutical Ltd
From what I can find, the clipping commands are currently no-ops in pdfminer: https://github.com/pdfminer/pdfminer.six/issues/414 - I'm not sure if this is something that needs to be supported in order for pdfplumber to be able to handle this?
From what I can find, the clipping commands are currently no-ops in pdfminer: https://github.com/pdfminer/pdfminer.six/issues/414 - I'm not sure if this is something that needs to be supported in order for pdfplumber to be able to handle this?
Thanks for this extra context, @cmdlineluser, and for flagging the pdfminer no-op. Unfortunately, that no-op blocks pdfplumber from making use of clipping paths. So not sure we can do much with this here. I keep a fairly close eye on pdfminer.six releases; if/when a future release includes clipping path information, I'll aim to incorporate it. (Maybe something like char["is_clipped"]: bool
.)
Trying all the various tools/libraries for extracting text - they all seem to extract mlDestacure as a single word.
I think the issue is that the l
in ml
bumps right up against the D
in Destacure
, thus providing no indication that they're part of separate words. The separation between the other examples of clipped text and the following column of text can be detected because the clipped text either extends beyond the beginning of the next column or stops a bit short of it.
I wonder why Gracure Pharmaceutical is "hidden" - yet it seems to be parsed correctly?
Hmm, I see Gracure Pharm [...]
in the PDF as not-hidden. But perhaps I'm misunderstanding?
It's possible I am the one misunderstanding things, or using the wrong terminology @jsvine
the ml
is hidden:
and here the macutical
:
I was just wondering how come it doesn't extract as PharmacuticalIndia
- similar to the mlDestacure
case - but perhaps it's because there is an actual space character in the text, even though it is "hidden".
Ah, I see; this is a good motivation for me to write more comprehensive documentation about how word segmentation works in pdfplumber. Until then:
x0
and x1
positions are compared to one another. If next_char["x0"] > curr_char["x1"] + x_tolerance
, we consider next_char
to begin a new word; otherwise next_char
is appended to the current word.use_text_flow=True
, the rule changes slightly. Rather than scan strictly left-to-right, pdfplumber instead examines the characters in the sequence they appear in the PDF's actual commands. Like before, next_char["x1"] > curr_char["x0"] + x_tolerance
will trigger a new word. But now so will next_char["x0"] < curr_char["x0"]
, indicative of the text "backtracking" to a further-left location.Because Gracure Pharmaceutical
extends (substantially, in fact) beyond the beginning of the next chunk of text, it triggers that second condition — as, I think, it should.
But because the l
in ml
begins just a bit before the D
in Destacure
, neither condition is met.
Note: Technically, both criteria are tested when use_text_flow=False
; it's just that the pre-sorting by x0
means that the second condition will never be triggered.
This is correctly what I was trying to say (though it should be next_char["x0"] > curr_char["x1"] + x_tolerance
?).
If detection of clipping is not possible, another idea is to check character spacing: With a certain font and font-size, spacing between 2 specific characters should be consistent (in theory).
For example: in our example mlDestacure
, spacing between "m" and "l" is consistent, while between "l" and "D" would be slightly different from the regular spacing "l" and "D" had they been part of the same word.
However, I haven't figured out the rule of spacing in PDF files (sometime characters even have negative spacing).
Since these PDF files has consistent font and font-size, if I can figure the spacing rule, I can write a parser to check individual spacing to see if it is part of the same word.
Note: by "spacing" i mean next_char["x0"] - curr_char["x1"]
This is correctly what I was trying to say (though it should be next_char["x0"] > curr_char["x1"] + x_tolerance?).
Thanks! Updated the comment to fix that.
If detection of clipping is not possible, another idea is to check character spacing: With a certain font and font-size, spacing between 2 specific characters should be consistent (in theory).
Yes, I think the difficulty here is the "(in theory)" part. In practice, I think we'd see a lot of unexpected violations of this theory — enough that it'd create a whole class of edge cases perhaps more common than the thing it's trying to fix.
That said, I'm quite open to being persuaded otherwise with examples and testing!
But because the l in ml begins just a bit before the D in Destacure, neither condition is met.
Ah, I see - the ml
doesn't actually overlap - so while both examples may look similar visually, they're different.
Thanks for the explanation @jsvine.
I noticed from the mutool trace
output in https://github.com/jsvine/pdfplumber/issues/912#issuecomment-1645369233 that it knows the text boundaries as they are each in their own <layer>
(I'm not sure if that is just a property for this particular PDF?)
From some poking around, it looks like these are the do_BDC()
and do_EMC()
commands in pdfminer:
Adding in some debug prints and in do_TJ()
:
[LAYER]
[SHOW_TEXT] seq=[b'0,', -9, b'5m', 38, b'g/m', 36, b'l', 38, b' ', -9, b'x', 20, b' ', -9, b'60 ', -9, b'm', 38, b'l']
[LAYER]
[SHOW_TEXT] seq=[b'De', 6, b's', 9, b't', -21, b'a', 4, b'c', 4, b'ur', -6, b'e']
Perhaps you know if it's somehow possible to use this layer information to help with this?
Really interesting, thanks for sharing @cmdlineluser. I think you're right about those layers being created by marked-content commands. As it happens @dhdaines is doing some experimentation with extracting those sections in https://github.com/jsvine/pdfplumber/pull/937.
Forcing a word-split when crossing in/out of a marked content section makes sense; certainly something worth trying out if we're able to merge that info.
Another option (perhaps defaulting to False
) would be to force word-splits on those <span>
s, which seem to correspond to TJ
calls. It seems that would work in this particular example, but would be practically universally-available across PDFs (unlike marked content commands, which only some PDFs implement).
Unfortunately, getting access to TJ
calls/information would seem to require subclassing pdfminer.six
; something to consider, but feels like a last resort. (Of course, there's the more radical option of swapping out pdfminer.six
for another library — something I've considered over the years — but that's going to require a lot more thinking/planning. Still: Open to your opinion on this!)
Ironically, I have a similar problem to this where a space character appears for unknown reasons just above a line of text and causes a word break due to the sorting of characters - in this PDF I get "63 5" instead of "635" at the bottom of the page... solution is either to use y_tolerance=1
or use_text_flow=True
...
Forcing a word-split when crossing in/out of a marked content section makes sense; certainly something worth trying out if we're able to merge that info.
This can be problematic because marked content section boundaries can show up just about anywhere - take this PDF for example, running:
pdf = pdfplumber.open(sys.argv[1])
page = pdf.pages[0]
for word in page.extract_words(extra_attrs=["mcid"]):
print(word["mcid"], word["text"])
You will see that basically every word has its own MCID, but also many words are split into multiple marked content sections:
90 personnage
92 historique
94 dé
95 c
96 é
97 d
98 é
Unfortunately, getting access to
TJ
calls/information would seem to require subclassingpdfminer.six
; something to consider, but feels like a last resort.
I already do this in #937 ;-) there is really no other option, particularly since pdfminer.six
does not appear to be actively maintained: https://github.com/jsvine/pdfplumber/pull/937/files#diff-646d362173010ce6a7ab11c23aba8e777a3943f84c3ac1f39a9e3a47c3ad6719R122
It kind of seems to me like the subset of functionality in pdfminer.six
that is actually used could simply be incorporated into pdfplumber
, which would give the opportunity to make it more efficient and fix problems like the bad type annotations...
As for switching to a different library ... there doesn't seem to exist one that has:
Maybe pdf-rs could be interesting in the future ... binding Python to Rust is relatively painless.
You will see that basically every word has its own MCID, but also many words are split into multiple marked content sections:
Actually - sorry for the spam here ... but in this case the MCIDs correspond to inline Span elements in the structure tree, so they should be expected to not force word breaks. See pdfinfo -struct-text
output:
Span (inline)
"historique"
Span (inline)
" "
Span (inline)
"dé"
Span (inline)
"c"
Span (inline)
"é"
Span (inline)
"d"
Span (inline)
"é"
Span (inline)
" "
So basically no we should not put word-breaks at marked content section boundaries unless we know that they are block elements.
Thanks for the notes, @dhdaines. Thoughts/responses below:
This can be problematic because marked content section boundaries can show up just about anywhere - take this PDF for example, running: [...]
Ah, very interesting, thanks. I wouldn't want MCIDs to be incorporated into word-splitting by default, but it might be a nice option to have available.
Unfortunately, getting access to TJ calls/information would seem to require subclassing pdfminer.six; something to consider, but feels like a last resort.
I already do this in https://github.com/jsvine/pdfplumber/pull/937 ;-) there is really no other option, particularly since pdfminer.six does not appear to be actively maintained: https://github.com/jsvine/pdfplumber/pull/937/files#diff-646d362173010ce6a7ab11c23aba8e777a3943f84c3ac1f39a9e3a47c3ad6719R122
Although it's true that pdfminer.six
hasn't had a commit in nine months, I'm not quite ready to give up on it. Activity on the project has been sporadic in the past, followed by spurts of improvements. I worry that monkey-patching puts us down a path of substantially greater development complexity, and may make it more difficult to incorporate pdfminer.six's future improvements (if they occur).
It kind of seems to me like the subset of functionality in pdfminer.six that is actually used could simply be incorporated into pdfplumber
Although it may not seem so at first, pdfminer.six
does a lot of heavy lifting, handling a lot of the frustrating edge-cases, nooks, and crannies of the PDF spec. To take a random-ish example, see https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/cmapdb.py
As for switching to a different library ... there doesn't seem to exist one that has: [...]
pypdfium2
seems the closest to me at this point. I haven't quite figured out how to get low-level access to individual char/path/etc. objects, but it seems like it should be possible. What do you think?
Although it's true that
pdfminer.six
hasn't had a commit in nine months, I'm not quite ready to give up on it. Activity on the project has been sporadic in the past, followed by spurts of improvements. I worry that monkey-patching puts us down a path of substantially greater development complexity, and may make it more difficult to incorporate pdfminer.six's future improvements (if they occur).
Well... it's only sort of monkey-patching since the pdfminer.six
API is designed around inheritance (which is really a bad idea in my opinion, but it is what it is). I can at least submit a pull request to support MCIDs in the PDFPageAggregator
, or better yet just a method to access the last object created by a PDFLayoutAnalyzer
.
pypdfium2
seems the closest to me at this point. I haven't quite figured out how to get low-level access to individual char/path/etc. objects, but it seems like it should be possible. What do you think?
Ah, yes, indeed. The Python bindings won't let you do this, but it is easy to call the underlying C API, which, at least for text, seems to give you everything you need to get individual characters and all their attributes:
https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h
You can see how to do this in the pypdfium2 documentation as well as in my code to read the structure tree.
I shouldn't let my allergy to Google-origin software cloud my judgement here :) and anyway PDFium wasn't originally created by Google and doesn't seem to have been infected by their software engineering practices and tools (monorepo, bazel, abseil, and that whole bestiary)
This is a continuation of a discussion posted here, please check for more info.
Describe the bug
When the pdf has overlapping columns (i.e the columns do not wrap text), all extraction methods (extract_tables, extract_text, extract_words) give incorrect result.
Original text
'aaaa b|bbb' and '1111' (the | is the separator line between the columns)
Expected behavior
'aaaa bbbb' and '1111'
Actual behavior
'aaaa b' and 'b1b1b11' when using extract_tables() or extract_text()
'aaaa' and 'bbbb1111'. when using extract_words(use_text_flow=True)
Sample pdf
https://github.com/jsvine/pdfplumber/files/11782271/sample.2.pdf