For some documents, many words get lumped together by get_text()

BrainAnnex commented 1 year ago

Describe the bug

I'm trying to extract text from PDF documents, to isolate individual words and create an indexing system.

For most PDF files, pymupdf (version 1.23.5) does a fine job... but for some files (such as the one enclosed, "Gravity.pdf"), a lot of words emerge glued together.

To Reproduce

The file in question (but NOT the only one!) is : Gravity.pdf

pip install pymupdf
import fitz
print(fitz.__doc__)    # Says  "PyMuPDF 1.23.5: Python bindings for the MuPDF 1.23.4 library"
pdf_name = "Gravity.pdf"
doc = fitz.open(pdf_name)
doc.metadata    # Shows:   'format': 'PDF 1.6'
page = doc.load_page(11)
page.get_text(flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_MEDIABOX_CLIP | fitz.TEXT_DEHYPHENATE)

Output

It contains several words fused together, such as in this portion and,\nsureenough,bothfellwiththesameaccelerationandreachedthe\nMoon’s surface together.2

Screenshots

Full text of the parse: error

This is how it looks in the PDF: source

Your configuration

I've tried it both on my local computer AND on Google Colab. The problem is the same!

print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) on Colab gives:

PyMuPDF 1.23.5: Python bindings for the MuPDF 1.23.4 library.
Version date: 2023-10-11 00:00:01.
Built for Python 3.10 on linux (64-bit).

On my local computer, it gives:

 win32 

PyMuPDF 1.23.5: Python bindings for the MuPDF 1.23.4 library.
Version date: 2023-10-11 00:00:01.
Built for Python 3.8 on win32 (64-bit).

PyMuPDF version, installation method (wheel or generated from source).

On Colab, I issue !pip install pymupdf , and it says:

Collecting pymupdf
  Downloading PyMuPDF-1.23.5-cp310-none-manylinux2014_x86_64.whl (4.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.3/4.3 MB 12.5 MB/s eta 0:00:00
Collecting PyMuPDFb==1.23.5 (from pymupdf)
  Downloading PyMuPDFb-1.23.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30.6/30.6 MB 43.3 MB/s eta 0:00:00
Installing collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.23.5 pymupdf-1.23.5

On my local computer, I let PyCharm deal with it. (I think it does a pip install)

Additional context

Words fused together occur A LOT when parsing the attached PDF.

I suspect you'll say that this file is malformed. Maybe it is... but another software library, pypdf , parses it just fine.

I have noticed that the lost spaces are far more prevalent in extractions by pypdf, compared to PyMuPDF - BUT for some files (such as the one I'm reporting here) it's the opposite, and pypdf does far better.

Empirically, I've noticed an intriguing complementary between pypdf and PyMuPDF : for files where one messes up badly, the other one does well - and vice versa. Maybe a different threshold of how to detect blank spaces in sentences? Maybe some insight to gain from this??

Thanks!

JorjMcKie commented 1 year ago

There is an algorithm in MuPDF, which generates spaces between characters where it seems appropriate - based on some criteria like font, font size, character width etc. In these cases, the threshold to start a new word has not been reached - and you can visually confirm yourself, that the words in your example are indeed positioned very closely together.

In any case it is a MuPDF issue, and I have submitted a bug report at its issue tracker here.

BrainAnnex commented 1 year ago

Thanks, @JorjMcKie !
It'd be nice to have a user-settable threshold, for situations (not super-common, but not exactly rare, either - in my tests) when the words are not spaced enough for the algorithm to make the right choice.

Does such a setting exists?

JorjMcKie commented 1 year ago

Thanks, @JorjMcKie ! It'd be nice to have a user-settable threshold, for situations (not super-common, but not exactly rare, either - in my tests) when the words are not spaced enough for the algorithm to make the right choice.

Does such a setting exists?

Thanks for the suggestion. But no, there is no such parameter yet. But how about suggesting this to the MuPDF dvelopers directly in this public Discord channel. Like with the PyMuPDF channel, there always are nice people around, open to discuss anything about MuPDF. Maybe there also are ideas that may help you.

You are aware that you can develop a circumvention yourself while waiting for a better solution? Just extract by character page.get_text("rawdict") and check the inter-character distances ...

JorjMcKie commented 1 year ago

For consideration, here is test script, that fiddles together an alternative plain text output based on a per-character extraction. This time, we trigger a word break, whenever the inter-character distance exceeds 1.5. The last line print the number of spaces as detected, respectively generated (by MuPDF) and the average width of these "natural" spaces:

import fitz

doc = fitz.open("Gravity.pdf")
page = doc[11]
space_count = 0
space_w = 0
for b in page.get_text("rawdict", flags=fitz.TEXTFLAGS_TEXT, sort=True)["blocks"]:
    for l in b["lines"]:
        text = ""
        chars = []
        for s in l["spans"]:
            chars.extend(s["chars"])
        char_count = len(chars)
        if char_count == 1:
            print(chars[0]["c"])
            continue
        for i in range(1, char_count):
            c0 = chars[i - 1]
            r0 = fitz.Rect(c0["bbox"])
            c1 = chars[i]
            r1 = fitz.Rect(c1["bbox"])
            text += c0["c"]
            if r1.x0 - r0.x1 >= 1.5:
                text += " "
            if c1["c"] == " ":
                space_count += 1
                space_w += r1.width
        print(text + c1["c"])
print()
print(f"space count {space_count}, avg width = {space_w/space_count}")

The output is as the reaer of the page would expect:

Can You Feel the Force?
3
front of a video camera (Figure 0.1).1 In the absence of any atmo-
sphere, the hammer and feather fell without any air resistance;
the only force acting upon them was the Moon’s gravity and,
sure enough, both fell with the same acceleration and reached the
Moon’s surface together.2
What would it feel like to fall towards the Moon’s surface along
with the feather and the hammer? We would fall at the same rate
as these other objects and hit the ground alongside both. Even
more importantly, all the parts of our body would be acceler-
ated in exactly the same way. Our head would fall with the same
acceleration as our kneecaps. Our feet would fall with the same
acceleration as our boots, consequently we would not feel any-
thing at all—we would be weightless. There is nothing special
about gravity on the Moon. What is special about the Moon is that
it is airless and, as there is no air resistance, the only force acting
on us is gravity.
Figure 0.1 Apollo 15 astronaut David Scott is standing on the Moon’s
surface holding a hammer in his right hand and a feather in his left.

space count 172, avg width = 2.584325213764989

BrainAnnex commented 1 year ago

Thanks @JorjMcKie ! I'm expanding the open-source project BrainAnnex.org to also provide a full-text indexing/search feature for documents (incl. PDF) managed by a "Knowledge and multimedia content management system" that employs the power of graph databases... I elaborate in this article.

I happen to have a substantial number of PDF books, scientific papers and other documents at my disposal... and I'm using some of them to test the system.

The extraction of individual words is the key element for this process.

Empirically, I found that PyMuPDF is vastly better than pdfplumber , and typically substantially better than pypdf - though, as I mentioned, there is a smallish but non-trivial number of cases where PyMuPDF errs a lot (and, interestingly pypdf does well in those cases... maybe some complementary aspect in their word-detection algorithms??)

I'm definitely impressed by your efforts with PyMuPDF , and the efforts of the MuPDF team!

I will experiment with the "rawdict" algorithm that you proposed - thanks! - and will report on the results. I understand that word detection in PDF's is something of an art form!

JorjMcKie commented 1 year ago

@BrainAnnex - thank you very much for your feedback! We are glad you find PyMuPDF useful.

maybe some complementary aspect in their word-detection algorithms??)

No, this is as simple as a lower threshold value for inter-character distances. In the case of your example, word separation will work congruently to the reader's perception if a distance larger than 25% of the current to the next character is taken as that threshold. This means, that the following code snippet will produce a satisfactory output:

import fitz

doc = fitz.open("Gravity.pdf")
page = doc[11]

for b in page.get_text("rawdict", flags=fitz.TEXTFLAGS_TEXT, sort=True)["blocks"]:
    for l in b["lines"]:
        text = ""
        chars = []
        for s in l["spans"]:
            chars.extend(s["chars"])
        char_count = len(chars)
        if char_count == 1:
            print(chars[0]["c"])
            continue
        for i in range(1, char_count):
            c0 = chars[i - 1]
            r0 = fitz.Rect(c0["bbox"])
            c1 = chars[i]
            r1 = fitz.Rect(c1["bbox"])
            text += c0["c"]
            if (c0["c"] != " " and c1["c"] != " "
                and r1.x0 - r0.x1 >= r0.width * 0.25):
                # distance to next char if both aren't space
                text += " "
        print(text + c1["c"])

julian-smith-artifex-com commented 12 months ago

There's an experimental alternative available in the recently-released "rebased" implementation of PyMuPDF-1.23.6, making direct use of MuPDF's Extract facility via the OutputType_DOCX device and the new space-guess setting.

Here's some example code that uses both Page.get_text() and OutputType_DOCX.

    import fitz_new as fitz

    path = os.path.relpath( 'Gravity.pdf')
    document = fitz.open(path)
    page = document.load_page(11)

    # Use Page.get_text().
    text = page.get_text(flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_MEDIABOX_CLIP | fitz.TEXT_DEHYPHENATE)
    n = text.count(' ')
    print( f'Text from page.get_text(): {n=}\n{text}')

    # Use MuPDF's Extract.
    buffer_ = fitz.mupdf.FzBuffer(1)
    out = fitz.mupdf.FzOutput( buffer_)
    space_guess = 0.3   # Expected width of spaces, relative to adjoining characters.
    writer = fitz.mupdf.FzDocumentWriter(
            out,
            f'text,space-guess={space_guess}',
            fitz.mupdf.FzDocumentWriter.OutputType_DOCX,
            )
    device = fitz.mupdf.fz_begin_page(writer, fitz.mupdf.fz_bound_page(page))
    fitz.mupdf.fz_run_page(page, device, fitz.mupdf.FzMatrix(), fitz.mupdf.FzCookie())
    fitz.mupdf.fz_end_page(writer)
    fitz.mupdf.fz_close_document_writer(writer)
    text = buffer_.fz_buffer_extract()
    text = text.decode('utf8')
    n = text.count(' ')
    print(f'Text from FzDocumentWriter.OutputType_DOCX: {n=}\n{text}')

For me, Page.get_text() gives text with 172 spaces containing "sureenough,bothfellwiththesameaccelerationandreachedthe", while OutputType_DOCX with space_guess = 0.3 gives text with 197 spaces without (i think) any incorrectly joined words.

Depending on how well this works in other cases, you might still be better off using @JorjMcKie's "rawdict" approach.

pymupdf / PyMuPDF