Inconsistent textpage.get_text_range results - Githubissues

pypdfium2-team / pypdfium2

Python bindings to PDFium

https://pypdfium2.readthedocs.io/

419 stars 17 forks source link

Inconsistent textpage.get_text_range results #261

Closed nikitar closed 1 year ago

nikitar commented 1 year ago

Essentially get_text_range(index, 1) != get_text_range(index - 1, 2)[1]. It's quite rare, but because the shift propagates throughout the rest of the page it's quite annoying. It means you cannot use api like textpage.get_charbox if you used get_text_range(0, -1) before, because indexes will be out of sync for the rest of the page.

I suspect this is a PDFium issue, but I don't have a C++ setup to test it directly, and it may be some crazy encoding thing, so though I'd report here first.

Full code, using the attached document.

PDF_PATH = .....

import pypdfium2
from contextlib import closing

with closing(pypdfium2.PdfDocument(PDF_PATH)) as pdf:
    # page index is 159 (0-based), but it's named 158 in pdf viewers because there are two named pages in the beginning
    page = pdf[159]
    textpage = page.get_textpage()

    char_index = 3400
    char1 = textpage.get_text_range(char_index, 1, 'strict')
    char2 = textpage.get_text_range(char_index - 1, 2, 'strict')[1]
    assert char1 == char2

Barclays-PLC-Annual-Report-2022.pdf

nikitar commented 1 year ago

Additionally, if you retrieve whole page text with textpage.get_text_range(0, -1, 'strict'), it will have a null char at the end. That does not happen for other pages, so I assume it's the same issue.

Screenshot 2023-09-21 at 2 03 21 pm

mara004 commented 1 year ago

I'm afraid I don't know any more than you do here.

Well, this currently is the code for get_text_range():

def get_text_range(self, index=0, count=-1, errors="ignore"):

    if count == -1:
        count = self.count_chars() - index

    n_bytes = count * 2
    buffer = ctypes.create_string_buffer(n_bytes+2)
    buffer_ptr = ctypes.cast(buffer, ctypes.POINTER(ctypes.c_ushort))

    pdfium_c.FPDFText_GetText(self, index, count, buffer_ptr)
    return buffer.raw[:n_bytes].decode("utf-16-le", errors=errors)

What catches the eye is that we don't handle the return of FPDFText_GetText(), but rely entirely on the count * 2. (Unlike many other APIs, FPDFText_GetText() can't be called with a NULL buffer to get the size first.) However, I guess we should use the exact return from FPDFText_GetText() when slicing the buffer (and probably add some check, too - CC #208). The trailing null char looks a bit odd, like the actual number of bytes written to the buffer might be 2 less than calculated.

mara004 commented 1 year ago

However, I guess we should use the exact return from FPDFText_GetText() when slicing the buffer

I pushed commit a9c64851eeec36ba0c352ecf776bbf1a92ab0a85 which (I think) should fix the trailing null char part of the issue in accordance with the hypothesis above. (Sorry for the accidental close, GH doesn't understand "Partially fix" correctly.)

mara004 commented 1 year ago

Continuing on this, I made the following observation:

>>> import pypdfium2.raw as pdfium_c
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-2)
3398
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400-1)
-1
>>> pdfium_c.FPDFText_GetTextIndexFromCharIndex(textpage, 3400)
3399

It looks like you've found the sample for https://groups.google.com/g/pdfium/c/LNkXslbSRPY 😅

mara004 commented 1 year ago

Putting the clues together, I think we need to use FPDFText_GetTextIndexFromCharIndex() to get the exact required buffer size to fix this correctly, not just working around a too large buffer. I will push a commit shortly.

Thank you for reporting this, that was very helpful!

mara004 commented 1 year ago

However, unfortunately that commit was not correct. I'll continue to investigate:

>>> textpage.get_text_range(3400-2, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../pypdfium2/_helpers/textpage.py", line 65, in get_text_range
    assert in_size == out_size
           ^^^^^^^^^^^^^^^^^^^
AssertionError

nikitar commented 1 year ago

Thank you for looking into this! I've created a ticket with pdfium, in case they might have more information. (I'm hoping they'll still classify it as a bug, as constantly converting from one indexing scheme to another is bound to be annoying)

mara004 commented 1 year ago

I think I've finally got get_text_range() correct with commit 535398d, but this was fairly complicated. (Commit 71bdc71 was wrong and is now reverted.)

Now I get this, which should be correct:

>>> textpage.get_text_range(3398, 1)
'z'
>>> textpage.get_text_range(3399, 1)
''
>>> textpage.get_text_range(3400, 1)
'\r'
>>> textpage.get_text_range(3398, 2)
'z'
>>> textpage.get_text_range(3399, 2)
'\r'
>>> textpage.get_text_range(3398, 3)
'z\r'

Char 3399 is "passive", i.e. it exists in the internal character list but is excluded from the FPDFText_GetText() output. In this light, you simply cannot expect calls like get_text_range(index-1, 2)[1] to work in the general case, e.g. get_text_range(3399, 2)[1] on the given PDF would now raise an index error, which is correct. That's just how the API is, FPDFText_GetText() may exclude or insert characters, so the returned text's length does not have to match the given count, even if it may for most PDFs.

mara004 commented 1 year ago

I'm hoping they'll still classify it as a bug, as constantly converting from one indexing scheme to another is bound to be annoying

Hmm, I suppose it's just two different representations / API layers. One is the internal character list as stored in the PDF, the other the "polished" output by FPDFText_GetText(). It may be confusing for a caller, but I don't think it's technically wrong. Though, I agree it might be better to avoid the distinction in the first place if it could be helped.

That said, even if we accept this design as intentional, here some concerns with the present situation:

Incomplete and non-obvious documentation (even if one can kind of see what is meant in retrospect)
FPDFText_GetText() behaving inconsistently when it comes to passive chars:
- Trailing passive chars are counted, meaning the function stops with the last active char in range(index, index+count) (OK).
- Leading passive chars, however, are not counted, meaning pdfium just increments the start index to the first active char and consumes count chars from that point onward, i.e. it may read beyond the scope range(index, index+count) (ODD).
pypdfium2 now accounts for this by adjusting index and count to exclude leading/trailing passive chars in the first place, so we are independent of how FPDFText_GetText() handles them. That was a bit troublesome to find out.
FPDFText_GetText() not allowing to be called with a NULL buffer first (unlike other text APIs), which means the caller has to install some fairly complicated logic to get the allocation right. A naive caller just constructing the buffer from char count without FPDFText_GetTextIndexFromCharIndex() might even threaten to run into OOB write in case of newly inserted characters.

mara004 commented 1 year ago

I believe this is fixed for what the pypdfium2 side is concerned. Any further considerations are up to pdfium (https://bugs.chromium.org/p/pdfium/issues/detail?id=2079), so I'd close this issue, OK? Feel free to reopen if you think there's something left to do in pypdfium2.