pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.78k stars 534 forks source link

Repeated word out of place ruins text #4042

Closed brucenielson closed 1 week ago

brucenielson commented 2 weeks ago

Karl Popper A World of Propensities

I'm using version 1.24.13.

Attached is a pdf I'm trying to load as markdown. The first page reads as follows:

I shall begin with some personal memories...

It was 54 years ago, in Prague in August 1934, that I first attended an International Congress of Philosophy. I found it uninspiring. But the Congress was preceded by another meeting in Prague, organized by Otto Neurath, who had kindly invited me to attend a 'Preliminary Conference' ('Vorkonferenz' as he called it) which he organized on behalf of the Vienna Circle.

I came to Prague with the corrected page proofs of my book, It Logik der Forschung. was published three months later...essentially an Aristotelian theory at which, it appears, Tarski and Godel arrived, independently at almost the same It time. was first published by Tarski in 1930, whereupon It Godel, of course, accepted Tarski's priority.

Note the weird newline after the first "It" and then that word "It" becomes an unwanted artifact that breaks up the text after that several times. This problem repeats on the next page with some other word.

I tried loading the same PDF using PyPDF and the problem goes away. (But obviously PyPDF isn't trying to convert to markdown.)

See: https://bugs.ghostscript.com/show_bug.cgi?id=708129

I can get you the file I used if desired.

brucenielson commented 2 weeks ago

Try this link:

JorjMcKie commented 2 weeks ago

I have no time to fight my way through that website. I don't see an obvious link to the PDF in questions. Do me a favor and download (parts of) the file.

brucenielson commented 2 weeks ago
JorjMcKie commented 2 weeks ago

Thanks for the file. Cannot reproduce your problem however: My output looks quite good given the fact that this is an OCR-ed book with its usual share of misreading ... [Uploading test.zip…]()

brucenielson commented 2 weeks ago

Do you have the resulting file I can look at? It surely doesn't work well for me. And a regular PyPDF extract works perfectly.

JorjMcKie commented 2 weeks ago

Here you are ... [test.zip]()

brucenielson commented 2 weeks ago

I copied your code and I still get the issue.

I'm on windows with these versions:

PyMuPDF==1.24.13 pymupdf4llm==0.0.17

Is it possibly a version or environment related issue?

brucenielson commented 2 weeks ago

I have confirmed that if I get the text page by page out of pymupdf directly there is no issue:

Ladies and Gentlemen, I shall begin with some personal memories and a personal confession of faith, and only then turn to the topic of my lecture. It was 54 years ago, in Prague in August 1934, that I first attended an International Congress of Philosophy. I found it uninspiring. But the Congress was preceded by another meeting in Prague, organized by Otto Neurath, who had kindly invited me to attend a 'Preliminary Conference' ('Vorkonferenz' as he called it) which he organized on behalf of the Vienna Circle. I came to Prague with the corrected page proofs of my book, Logik der Forschung. It was published three months later in Vienna, and in English 25 years later as The Logic of Scientific Discovery. In Prague it was read by two Polish philosophers, Alfred Tarski and Janina Hosiasson-Lindenbaum, the wife of Tarski's friend and collaborator, Adolf Lindenbaum. Janina Hosiasson and her husband were murdered when, 5 years later, the Nazis invaded Poland and systematically exterminated what they described as its 'Fuhrerschicht': its 'intellectual elite'. Tarski went from Prague to Vienna where he stayed for a year and where we became friends. Philosophically, it was the most important friendship of my life. For I learnt from Tarski the logical defensibility and the power of absolute and objective truth: essentially an Aristotelian theory at which, it appears, Tarski and Godel arrived, independently at almost the same time. It was first published by Tarski in 1930, whereupon Godel, of course, accepted Tarski's priority. It is a theory of objective truth - truth as the correspondence of a statement with the facts - and of absolute truth: if an unambiguously formulated statement is true in one language, then any correct translation of it into any other language is also true. This theory is the great bulwark against relativism and 3

brucenielson commented 2 weeks ago

My code:

import pymupdf4llm, pymupdf
import pathlib

doc = pymupdf.open(r"D:\Documents\AI\BookSearchArchive\documents\A World of Propensities by Karl Popper (1997).pdf")
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    text = page.get_text("text")
    print(text)
md = pymupdf4llm.to_markdown(doc)
path = pathlib.Path("test.txt")
path.write_text(md, encoding="utf-8")

But the to_markdown call introduces the problem.

To be clear this call has no bug:

doc = pymupdf.open(r"D:\Documents\AI\BookSearchArchive\documents\A World of Propensities by Karl Popper (1997).pdf")

This call creates the problem of the repeated 'It': md = pymupdf4llm.to_markdown(doc)

JorjMcKie commented 2 weeks ago

Then this issue does not belong in this repository. It is not even a bug probably because there is no provision in standard get_text() to look at page layout and other complications.

brucenielson commented 2 weeks ago

I don't follow why you are saying it is not a bug. Clearly it doesn't work as intended for some reason. Or rather it does for you for your environment and version and does not for me. Even though we're using the same code.

It must be a bug in pymupdf4llm's to_markdown call. I traced it to 'write_text'.

I take it pymupdf4llm is not part of this repo? What is the correct repo? Pymupdf/RAG is the repo the specifically mentions pymupdf4llm.

JorjMcKie commented 2 weeks ago

I tried it again and still cannot reproduce the problem. Here is my result [original.zip](). Environment Windows 3.11, Python 3.9, pymupdf 1.24.13, pymupdf4llm 0.0.17.

brucenielson commented 2 weeks ago

Here is my result with the same code you are using.

brucenielson commented 2 weeks ago

My environment is the same as yours except I'm on a newer version of Python. i.e. 3.11 of Python. I'll have to try loading 3.9 and see if it makes a difference.

brucenielson commented 2 weeks ago

Here is a new attempt. Still using Python 3.11 but now I've removed everything from the virtual environment other than:

pip install pymupdf4llm

And I still get the weird effect. Let me try Python 3.9. It is no longer considered up to date, so there is no installer available any more. But they do have a zip and I'll try it out.

brucenielson commented 2 weeks ago

Well, I tried downloading Python 3.9 and I get the same problem. So I guess that wasn't it either. I am on Windows 11. image

JorjMcKie commented 2 weeks ago

I tried multiple Python versions and still cannot reproduce your problem. Out of ideas 🤷‍♂️.

brucenielson commented 2 weeks ago

Yeah, same here. I can't figure out why I'm getting it and you aren't. I've tried eliminating all the obvious possibilities like version of python, incompatibility with some other module, version, etc. Nothing. I tried installing a brand new Python 9 with nothing in it but pymupdf4llm. Nothing worked.

brucenielson commented 2 weeks ago

Okay, I got the problem to repeat in Colab!

Just to be sure, here is the document again that I'm using.

brucenielson commented 2 weeks ago

From the Colab link I just sent:

Ladies and Gentlemen,

I shall begin with some personal memories and a personal confession of faith, and only then turn to the topic of my lecture.

It was 54 years ago, in Prague in August 1934, that I first attended an International Congress of Philosophy. I found it uninspiring. But the Congress was preceded by another meeting in Prague, organized by Otto Neurath, who had kindly invited me to attend a 'Preliminary Conference' ('Vorkonferenz' as he called it) which he organized on behalf of the Vienna Circle.

I came to Prague with the corrected page proofs of my book, It Logik der Forschung. was published three months later in Vienna, and in English 25 years later as The Logic of Scientific Discovery. In Prague it was read by two Polish philosophers, Alfred Tarski and Janina Hosiasson-Lindenbaum, the wife of Tarski's friend and collaborator, Adolf Lindenbaum. Janina Hosiasson and her husband were murdered when, 5 years later, the Nazis invaded Poland and systematically exterminated what they described as its 'Fuhrerschicht': its 'intellectual elite'. Tarski went from Prague to Vienna where he stayed for a year and where we became friends. Philosophically, it was the most important friendship of my life. For I learnt from Tarski the logical defensibility and the power of absolute and objective truth: essentially an Aristotelian theory at which, it appears, Tarski and Godel arrived, independently at almost the same It time. was first published by Tarski in 1930, whereupon It Godel, of course, accepted Tarski's priority. is a theory of objective truth - truth as the correspondence of a statement with the facts - and of absolute truth: if an unambiguously formulated statement is true in one language, then any correct translation of it into any other language is also true. This theory is the great bulwark against relativism and

brucenielson commented 1 week ago

Can we please delete out the pdf from the issue once you have it downloaded to work with? Probably need to remove links to the output too since it probably contains the entire pdf.

JorjMcKie commented 1 week ago

Can we please delete out the pdf from the issue once you have it downloaded to work with? Probably need to remove links to the output too since it probably contains the entire pdf.

Done.