Closed nikitar closed 1 year ago
Additionally, if you retrieve whole page text with textpage.get_text_range(0, -1, 'strict')
, it will have a null char at the end. That does not happen for other pages, so I assume it's the same issue.
I'm afraid I don't know any more than you do here.
Well, this currently is the code for get_text_range()
:
def get_text_range(self, index=0, count=-1, errors="ignore"):
if count == -1:
count = self.count_chars() - index
n_bytes = count * 2
buffer = ctypes.create_string_buffer(n_bytes+2)
buffer_ptr = ctypes.cast(buffer, ctypes.POINTER(ctypes.c_ushort))
pdfium_c.FPDFText_GetText(self, index, count, buffer_ptr)
return buffer.raw[:n_bytes].decode("utf-16-le", errors=errors)
What catches the eye is that we don't handle the return of FPDFText_GetText()
, but rely entirely on the count * 2
.
(Unlike many other APIs, FPDFText_GetText()
can't be called with a NULL
buffer to get the size first.)
However, I guess we should use the exact return from FPDFText_GetText()
when slicing the buffer (and probably add some check, too - CC #208). The trailing null char looks a bit odd, like the actual number of bytes written to the buffer might be 2 less than calculated.
However, I guess we should use the exact return from FPDFText_GetText() when slicing the buffer
I pushed commit a9c64851eeec36ba0c352ecf776bbf1a92ab0a85 which (I think) should fix the trailing null char part of the issue in accordance with the hypothesis above. (Sorry for the accidental close, GH doesn't understand "Partially fix" correctly.)
Continuing on this, I made the following observation:
>>> import pypdfium2.raw as pdfium_c
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-2)
3398
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-1)
-1
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400)
3399
It looks like you've found the sample for https://groups.google.com/g/pdfium/c/LNkXslbSRPY 😅
Putting the clues together, I think we need to use FPDFText_GetTextIndexFromCharIndex()
to get the exact required buffer size to fix this correctly, not just working around a too large buffer. I will push a commit shortly.
Thank you for reporting this, that was very helpful!
However, unfortunately that commit was not correct. I'll continue to investigate:
>>> textpage.get_text_range(3400-2, 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../pypdfium2/_helpers/textpage.py", line 65, in get_text_range
assert in_size == out_size
^^^^^^^^^^^^^^^^^^^
AssertionError
Thank you for looking into this! I've created a ticket with pdfium, in case they might have more information. (I'm hoping they'll still classify it as a bug, as constantly converting from one indexing scheme to another is bound to be annoying)
I think I've finally got get_text_range()
correct with commit 535398d, but this was fairly complicated.
(Commit 71bdc71 was wrong and is now reverted.)
Now I get this, which should be correct:
>>> textpage.get_text_range(3398, 1)
'z'
>>> textpage.get_text_range(3399, 1)
''
>>> textpage.get_text_range(3400, 1)
'\r'
>>> textpage.get_text_range(3398, 2)
'z'
>>> textpage.get_text_range(3399, 2)
'\r'
>>> textpage.get_text_range(3398, 3)
'z\r'
Char 3399
is "passive", i.e. it exists in the internal character list but is excluded from the FPDFText_GetText()
output.
In this light, you simply cannot expect calls like get_text_range(index-1, 2)[1]
to work in the general case, e.g. get_text_range(3399, 2)[1]
on the given PDF would now raise an index error, which is correct.
That's just how the API is, FPDFText_GetText()
may exclude or insert characters, so the returned text's length does not have to match the given count, even if it may for most PDFs.
I'm hoping they'll still classify it as a bug, as constantly converting from one indexing scheme to another is bound to be annoying
Hmm, I suppose it's just two different representations / API layers. One is the internal character list as stored in the PDF, the other the "polished" output by FPDFText_GetText()
. It may be confusing for a caller, but I don't think it's technically wrong. Though, I agree it might be better to avoid the distinction in the first place if it could be helped.
That said, even if we accept this design as intentional, here some concerns with the present situation:
Incomplete and non-obvious documentation (even if one can kind of see what is meant in retrospect)
FPDFText_GetText()
behaving inconsistently when it comes to passive chars:
range(index, index+count)
(OK).count
chars from that point onward, i.e. it may read beyond the scope range(index, index+count)
(ODD).pypdfium2 now accounts for this by adjusting index and count to exclude leading/trailing passive chars in the first place, so we are independent of how FPDFText_GetText()
handles them.
That was a bit troublesome to find out.
FPDFText_GetText()
not allowing to be called with a NULL buffer first (unlike other text APIs), which means the caller has to install some fairly complicated logic to get the allocation right. A naive caller just constructing the buffer from char count without FPDFText_GetTextIndexFromCharIndex()
might even threaten to run into OOB write in case of newly inserted characters.
I believe this is fixed for what the pypdfium2 side is concerned. Any further considerations are up to pdfium (https://bugs.chromium.org/p/pdfium/issues/detail?id=2079), so I'd close this issue, OK? Feel free to reopen if you think there's something left to do in pypdfium2.
Essentially
get_text_range(index, 1) != get_text_range(index - 1, 2)[1]
. It's quite rare, but because the shift propagates throughout the rest of the page it's quite annoying. It means you cannot use api liketextpage.get_charbox
if you usedget_text_range(0, -1)
before, because indexes will be out of sync for the rest of the page.I suspect this is a PDFium issue, but I don't have a C++ setup to test it directly, and it may be some crazy encoding thing, so though I'd report here first.
Full code, using the attached document.
Barclays-PLC-Annual-Report-2022.pdf