Closed BrainAnnex closed 4 months ago
There is an algorithm in MuPDF, which generates spaces between characters where it seems appropriate - based on some criteria like font, font size, character width etc. In these cases, the threshold to start a new word has not been reached - and you can visually confirm yourself, that the words in your example are indeed positioned very closely together.
In any case it is a MuPDF issue, and I have submitted a bug report at its issue tracker here.
Thanks, @JorjMcKie !
It'd be nice to have a user-settable threshold, for situations (not super-common, but not exactly rare, either - in my tests) when the words are not spaced enough for the algorithm to make the right choice.
Does such a setting exists?
Thanks, @JorjMcKie ! It'd be nice to have a user-settable threshold, for situations (not super-common, but not exactly rare, either - in my tests) when the words are not spaced enough for the algorithm to make the right choice.
Does such a setting exists?
Thanks for the suggestion. But no, there is no such parameter yet. But how about suggesting this to the MuPDF dvelopers directly in this public Discord channel. Like with the PyMuPDF channel, there always are nice people around, open to discuss anything about MuPDF. Maybe there also are ideas that may help you.
You are aware that you can develop a circumvention yourself while waiting for a better solution? Just extract by character page.get_text("rawdict")
and check the inter-character distances ...
For consideration, here is test script, that fiddles together an alternative plain text output based on a per-character extraction. This time, we trigger a word break, whenever the inter-character distance exceeds 1.5. The last line print the number of spaces as detected, respectively generated (by MuPDF) and the average width of these "natural" spaces:
import fitz
doc = fitz.open("Gravity.pdf")
page = doc[11]
space_count = 0
space_w = 0
for b in page.get_text("rawdict", flags=fitz.TEXTFLAGS_TEXT, sort=True)["blocks"]:
for l in b["lines"]:
text = ""
chars = []
for s in l["spans"]:
chars.extend(s["chars"])
char_count = len(chars)
if char_count == 1:
print(chars[0]["c"])
continue
for i in range(1, char_count):
c0 = chars[i - 1]
r0 = fitz.Rect(c0["bbox"])
c1 = chars[i]
r1 = fitz.Rect(c1["bbox"])
text += c0["c"]
if r1.x0 - r0.x1 >= 1.5:
text += " "
if c1["c"] == " ":
space_count += 1
space_w += r1.width
print(text + c1["c"])
print()
print(f"space count {space_count}, avg width = {space_w/space_count}")
The output is as the reaer of the page would expect:
Can You Feel the Force?
3
front of a video camera (Figure 0.1).1 In the absence of any atmo-
sphere, the hammer and feather fell without any air resistance;
the only force acting upon them was the Moon’s gravity and,
sure enough, both fell with the same acceleration and reached the
Moon’s surface together.2
What would it feel like to fall towards the Moon’s surface along
with the feather and the hammer? We would fall at the same rate
as these other objects and hit the ground alongside both. Even
more importantly, all the parts of our body would be acceler-
ated in exactly the same way. Our head would fall with the same
acceleration as our kneecaps. Our feet would fall with the same
acceleration as our boots, consequently we would not feel any-
thing at all—we would be weightless. There is nothing special
about gravity on the Moon. What is special about the Moon is that
it is airless and, as there is no air resistance, the only force acting
on us is gravity.
Figure 0.1 Apollo 15 astronaut David Scott is standing on the Moon’s
surface holding a hammer in his right hand and a feather in his left.
space count 172, avg width = 2.584325213764989
Thanks @JorjMcKie ! I'm expanding the open-source project BrainAnnex.org to also provide a full-text indexing/search feature for documents (incl. PDF) managed by a "Knowledge and multimedia content management system" that employs the power of graph databases... I elaborate in this article.
I happen to have a substantial number of PDF books, scientific papers and other documents at my disposal... and I'm using some of them to test the system.
The extraction of individual words is the key element for this process.
Empirically, I found that PyMuPDF
is vastly better than pdfplumber
, and typically substantially better than pypdf
- though, as I mentioned, there is a smallish but non-trivial number of cases where PyMuPDF
errs a lot (and, interestingly pypdf
does well in those cases... maybe some complementary aspect in their word-detection algorithms??)
I'm definitely impressed by your efforts with PyMuPDF
, and the efforts of the MuPDF
team!
I will experiment with the "rawdict" algorithm that you proposed - thanks! - and will report on the results. I understand that word detection in PDF's is something of an art form!
@BrainAnnex - thank you very much for your feedback! We are glad you find PyMuPDF useful.
maybe some complementary aspect in their word-detection algorithms??)
No, this is as simple as a lower threshold value for inter-character distances. In the case of your example, word separation will work congruently to the reader's perception if a distance larger than 25% of the current to the next character is taken as that threshold. This means, that the following code snippet will produce a satisfactory output:
import fitz
doc = fitz.open("Gravity.pdf")
page = doc[11]
for b in page.get_text("rawdict", flags=fitz.TEXTFLAGS_TEXT, sort=True)["blocks"]:
for l in b["lines"]:
text = ""
chars = []
for s in l["spans"]:
chars.extend(s["chars"])
char_count = len(chars)
if char_count == 1:
print(chars[0]["c"])
continue
for i in range(1, char_count):
c0 = chars[i - 1]
r0 = fitz.Rect(c0["bbox"])
c1 = chars[i]
r1 = fitz.Rect(c1["bbox"])
text += c0["c"]
if (c0["c"] != " " and c1["c"] != " "
and r1.x0 - r0.x1 >= r0.width * 0.25):
# distance to next char if both aren't space
text += " "
print(text + c1["c"])
There's an experimental alternative available in the recently-released "rebased" implementation of PyMuPDF-1.23.6, making direct use of MuPDF's Extract facility via the OutputType_DOCX
device and the new space-guess
setting.
Here's some example code that uses both Page.get_text()
and OutputType_DOCX
.
import fitz_new as fitz
path = os.path.relpath( 'Gravity.pdf')
document = fitz.open(path)
page = document.load_page(11)
# Use Page.get_text().
text = page.get_text(flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_MEDIABOX_CLIP | fitz.TEXT_DEHYPHENATE)
n = text.count(' ')
print( f'Text from page.get_text(): {n=}\n{text}')
# Use MuPDF's Extract.
buffer_ = fitz.mupdf.FzBuffer(1)
out = fitz.mupdf.FzOutput( buffer_)
space_guess = 0.3 # Expected width of spaces, relative to adjoining characters.
writer = fitz.mupdf.FzDocumentWriter(
out,
f'text,space-guess={space_guess}',
fitz.mupdf.FzDocumentWriter.OutputType_DOCX,
)
device = fitz.mupdf.fz_begin_page(writer, fitz.mupdf.fz_bound_page(page))
fitz.mupdf.fz_run_page(page, device, fitz.mupdf.FzMatrix(), fitz.mupdf.FzCookie())
fitz.mupdf.fz_end_page(writer)
fitz.mupdf.fz_close_document_writer(writer)
text = buffer_.fz_buffer_extract()
text = text.decode('utf8')
n = text.count(' ')
print(f'Text from FzDocumentWriter.OutputType_DOCX: {n=}\n{text}')
For me, Page.get_text()
gives text with 172 spaces containing "sureenough,bothfellwiththesameaccelerationandreachedthe", while OutputType_DOCX
with space_guess = 0.3
gives text with 197 spaces without (i think) any incorrectly joined words.
Depending on how well this works in other cases, you might still be better off using @JorjMcKie's "rawdict" approach.
Describe the bug
I'm trying to extract text from PDF documents, to isolate individual words and create an indexing system.
For most PDF files, pymupdf (version 1.23.5) does a fine job... but for some files (such as the one enclosed, "Gravity.pdf"), a lot of words emerge glued together.
To Reproduce
The file in question (but NOT the only one!) is : Gravity.pdf
Output
It contains several words fused together, such as in this portion
and,\nsureenough,bothfellwiththesameaccelerationandreachedthe\nMoon’s surface together.2
Screenshots
Full text of the parse:
This is how it looks in the PDF:
Your configuration
print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
on Colab gives:On my local computer, it gives:
On Colab, I issue
!pip install pymupdf
, and it says:On my local computer, I let PyCharm deal with it. (I think it does a
pip install
)Additional context
Words fused together occur A LOT when parsing the attached PDF.
I suspect you'll say that this file is malformed. Maybe it is... but another software library,
pypdf
, parses it just fine.I have noticed that the lost spaces are far more prevalent in extractions by
pypdf
, compared to PyMuPDF - BUT for some files (such as the one I'm reporting here) it's the opposite, andpypdf
does far better.Empirically, I've noticed an intriguing complementary between
pypdf
andPyMuPDF
: for files where one messes up badly, the other one does well - and vice versa. Maybe a different threshold of how to detect blank spaces in sentences? Maybe some insight to gain from this??Thanks!