pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.66k stars 460 forks source link

unexpected output from getText() #501

Closed zoj613 closed 4 years ago

zoj613 commented 4 years ago

Please provide all mandatory information!

Describe the bug (mandatory)

Get a string of unrecognisable characters when calling getText() on a page of a pdf. this is the string I get: ����������� �� ����������\n\n\n���� �������� � ������� �� ���� ���� �������� �� ���� �� ��� ���������� ������� �� ���� ������ �����\n\n\n���� ������\n\n\n� ���� ��� ������\n\n\n������ ���� �����������\n\n\n� ������� �����\n\n\n������� ������������\n\n\n� �� ������\n\n\n���������� ������\n\n\n� ���������\n\n\n������ ������� ������� �������\n\n\n� ��� ������\n\n\n�������� ���� ��� ������\n\n\n� ����������\n\n\n����������� ����\n\n\n� ����������\n\n\n��������� ����\n\n\n� ����������\n\n\n������ �������\n\n\n������\n\n\n�������\n\n\n����� ����\n\n\n�����\n\n\n���\n\n\n���� �� ����� ��\n\n\n�� ������\n\n\n���� ������ ���\n\n\n������\n\n\n�����\n\n\n��� ������\n\n\n������ �����\n\n\n����\n\n\n�\n\n\n�������������\n\n\n����������\n\n\n������� �� ����������\n\n\n������\n\n\n�������\n\n\n����� ����\n\n\n�����\n\n\n���\n\n\n���� �� ����� �� ��\n\n\n������\n\n\n���� ������ ���\n\n\n������\n\n\n���� ��\n\n\n����������\n\n\n���������\n\n\n��� ������\n\n\n����\n\n\n���\n\n\n�\n\n\n�������������\n\n\n����������\n\n\n����������\n\n\n������\n\n\n��� ������\n\n\n������� �����\n\n\n���\n\n\n�\n\n\n�������������\n\n\n����������\n\n\n����������\n\n\n�����\n\n\n��� ������\n\n\n������ �����\n\n\n���\n\n\n�\n\n\n�������������\n\n\n����������\n\n\n����������\n\n\n�����\n\n\n��� ������\n\n\n���� �������\n\n\n���\n\n\n�\n\n\n�������������\n\n\n����������\n\n\n����������\n\n\n������� �������\n\n\n������\n\n\n�������\n\n\n����� ����\n\n\n�����������\n\n\n��������� ����\n\n\n��������� ��\n\n\n���������\n\n\n��� ������\n\n\n����\n\n\n��� ����������\n\n\n������\n\n\n��� ������\n\n\n������� �����\n\n\n��� ����������\n\n\n�����\n\n\n��� ������\n\n\n������ �����\n\n\n��� ����������\n\n\n�����\n\n\n��� ������\n\n\n������ �����\n\n\n��� ����������\n\n\n�����\n\n\n��� ������\n\n\n���� �������\n\n\n��� ����������\n\n\n����������� �������\n\n\n������\n\n\n�������\n\n\n����� ����\n\n\n����������� �������\n\n\n��������� ����\n\n\n���������\n\n\n��� ������\n\n\n����\n\n\n��� ����������\n\n\n������\n\n\n��� ������\n\n\n������� �����\n\n\n��� ����������\n\n\n�����\n\n\n��� ������\n\n\n������ �����\n\n\n��� ����������\n\n\n�����\n\n\n��� ������\n\n\n������ �����\n\n\n��� ����������\n\n\n�����\n\n\n��� ������\n\n\n���� �������\n\n\n��� ����������\n\n\n����������� �� ����������� �������\n\n\n� � ��� ������ ��� ��� ���� �� � ������� ������ ��� � �� � ������\n\n\n� � ��� ������ ��� ��� ���� �� � ������� ������ ��� � �� �� ������\n\n\n� � ��� ������ ��� ��� ���� �� � ������� ������ ��� �� �� �� ������\n\n\n� � ��� ������ ��� ��� ���� �� � ������� ������ ��� �� � ������\n

To Reproduce (mandatory)

        for i, page in enumerate(doc):
            txt = page.getText()

Expected behavior (optional)

Normal human readable text .

Your configuration (mandatory)

3.7.3 (default, Apr 8 2020, 16:07:18) [GCC 6.5.0 20181026] linux

PyMuPDF 1.16.18: Python bindings for the MuPDF 1.16.0 library. Version date: 2020-04-22 14:00:00. Built for Python 3.7 on linux (64-bit).

Additional context (optional)

trying to extract text from pdf document pages. Most work as expected except when I get this kind of output

JorjMcKie commented 4 years ago

This works as expected! The page has Unicode characters. When you try this code under e.g. IDLE you will see it. Or save the output of getText as a binary like this

outtext = open("output.txt", "wb")
outtext.write(page.getText().encode("utf8"))
outtext.close()

With an adult text editor you should also see the Chinese characters (or whatever they are).

zoj613 commented 4 years ago
.encode("utf8")

You suggestion still doesnt work no matter what IDE or text editor. Plus those characters you call chinese are supposed to be english words. I even tried PyPDF and get the correct english output instead of the unicode characters displayed in the OP. So that why I say im getting unexpected output because even while using muPDF the other pdfs of the same kind display correct output.

JorjMcKie commented 4 years ago

Then I need that problem PDF.

JorjMcKie commented 4 years ago

Then I need that problem PDF.

Please send me the failing PDF.

zoj613 commented 4 years ago

Then I need that problem PDF.

Please send me the failing PDF.

Unfortunately, It is a private email attachment. Im not sure I have the permission to share that document publicly so you can see it

JorjMcKie commented 4 years ago

Well, then let's close the issue until you find a way sharing data for reproducing the error.

zoj613 commented 4 years ago

Well, then let's close the issue until you find a way sharing data for reproducing the error.

Sure, I will reopen once I get permission to, or find another document that can reproduce the result,

lingzhang00 commented 4 years ago

@JorjMcKie Hi, great job, your tool works fine except one file. We still have the same problem after trying your solution. The attached is the failing file, and it is in Chinese. Could you please check it for us? Many thanks. 11-1.pdf

JorjMcKie commented 4 years ago

@lingzhang00 Hi, I checked out the file using several viewers and also extracted the text using the mutool draw/convert utility from Artifex/MuPDF. Bottom line: None of them worked, this PDF uses a non-standard (i.e. non-UTF8) encoding. This prevents successful text extraction. You can check this out yourself and copy / paste text from it to e.g. Word: you will get unreadable crab. I know, it is irritating that the visual appearance on the other hand looks fine. But these are two different things: the embedded font program delivers painting instructions to the PDF viewer, which then produces a nice-looking image. The only way to get the text out of this is making an image from the page and then OCR-ing this ...

lingzhang00 commented 4 years ago

@lingzhang00 Hi, I checked out the file using several viewers and also extracted the text using the mutool draw/convert utility from Artifex/MuPDF. Bottom line: None of them worked, this PDF uses a non-standard (i.e. non-UTF8) encoding. This prevents successful text extraction. You can check this out yourself and copy / paste text from it to e.g. Word: you will get unreadable crab. I know, it is irritating that the visual appearance on the other hand looks fine. But these are two different things: the embedded font program delivers painting instructions to the PDF viewer, which then produces a nice-looking image. The only way to get the text out of this is making an image from the page and then OCR-ing this ...

@JorjMcKie Many thanks for your quick response. Your comments confirm our idea. Anyway, this project runs very well, and it works in most cases for us. Good job.

jeanmonet commented 3 years ago

@JorjMcKie hi I've run into this problem as well. Text extraction for following PDF outputs gibberish using PyMuPDF: font_pb.pdf

However I can achieve text extraction successfully using tika on this same document. I'm also able to successfully copy & paste text using Edge's PDF viewer.

So text extraction remains possible here. Is there a way to achieve this result using PyMuPDF? If not, can the manner in which tika does it be implemented in PyMuPDF in the future?

JorjMcKie commented 3 years ago

@jeanmonet - this is not a missing feature of PyMuPDF, but MuPDF: if you use their cli tool: mutool draw -o test.txt font_pb.pdf you get nonsense output too. When trying to convert / rebuild the PDF, errors / warnings exhibit some clue:

mutool -o test.pdf font_pb.pdf
warning: cannot create ToUnicode mapping for PDCXXT+AllianzNeoSemiBold
warning: cannot create ToUnicode mapping for HWEPHL+AllianzNeoLight

and the text extraction output contains nonsense again. When using a tool from XPDF, pdftotext, similar error messages are issued and nonsense output is produced as well:

pdftotext font_pb.pdf
Syntax Error: Unknown character collection 'Actuate-Identity'
Syntax Error: Unknown character collection 'Actuate-Identity'
Syntax Error: Unknown character collection 'Actuate-Identity'
Syntax Error: Unknown character collection 'Actuate-Identity'

So I suggest you submit a bug / feature request directly to MuPDF's issue maintenance system: https://bugs.ghostscript.com/enter_bug.cgi.

JorjMcKie commented 3 years ago

Other tools' output:

jeanmonet commented 3 years ago

Thank you for your quick feedback. I'll propose this problem as you suggest on MuPDF's issue tracker. Indeed I've also tried some other tools including pdftotext (poppler on Ubuntu) that didn't handle the text either, but found it interesting that Edge's PDF viewer (didn't try Adobe) could decode/extract the text accurately, and that tika does the job as well, so there must be some way to decode this one. Incidentally, the PDF proposed above (11-1.pdf) did not work well (gibberish output) with Edge or tika.

JorjMcKie commented 3 years ago

It is a different discussion, but depending on urgency, you might be interested to look at https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/OCR/tesseract1.py which "interactively" invokes Tesseract for every text span that contains characters that cannot be interpreted. Of course a brute force approach, which is not fast but it does work in your case. Here is a log:

python tesseract1.py
before: '�����'
 after: 'Frais'
before: '��������������������������������������������������������������'
 after: 'Les frais et commissions acauittés servent à couvrir Les coûts'
before: '������������������������������������������������'
 after: 'd'exploitation de l'OPCVM y compris Les coûts de'
before: '�������������������������������������������������������������������'
 after: 'commercialisation et de distribution des parts, ces frais réduisent'
before: '����������������������������������������������'
 after: 'la croissance potentielle des investissements.'
before: '������������������������������������������������������'
 after: 'Frais ponctuels prélevés avant ou après investissement'
before: '��������������'
 after: 'Frais d'entrée'
before: '������'
 after: '0,00 %'
before: '���������������'
 after: 'Frais de sortie'
before: '������'
 after: '0,00 %'
before: '��������������������������������������������������������������������'
 after: 'Le pourcentage indiqué est le maximum pouvant être prélevé sur votre'
before: '��������������������������������������������������������������������������'
 after: 'capital. Dans certains cas l'investisseur peut payer moins. L'investisseur'
before: '�������������������������������������������������������������������������'
 after: 'peut obtenir de son conseiller où de son distributeur Le montant effectif'
before: '��������������������������������'
 after: 'des frais d'entrée et de sortie.'
before: '����������������������������������������'
 after: 'Frais prélevés par l'OPCVM sur une année'
before: '��������������'
 after: 'Frais courants'
before: '������'
 after: '1,83 %'
before: '�������������������������������������������������������'
 after: 'Frais prélevés par l'OPCVM dans certaines circonstances'
before: '�������������'
 after: 'Commission de'
before: '�����������'
 after: 'performance'
before: '�����'
 after: 'néant'
before: '��������������������������������������������������������'
 after: 'Le pourcentage des frais courants communiqué ici est une'
before: '�����������'
 after: 'estimation.'
before: '���������������������������������������������������������'
 after: 'Les frais courants ne comprennent pas: les commissions de'
before: '����������������������������������������������������������������'
 after: 'surperformance et Les frais d'intermédiation excepté dans le cas'
before: '�������������������������������������������������������������'
 after: 'de frais d'entrée et/ou de sortie payés par l'OPCVM lorsqu'il'
before: '�������������������������������������������������������'
 after: 'achète ou vend des parts d'un autre véhicule de gestion'
before: '�����������'
 after: 'collective.'
before: '������������������������������������������������������������������'
 after: 'Pour plus d'informations sur les frais, veuillez-vous référer à la'
before: '������������������������������������������������������������'
 after: 'section « Frais et commissions » du prospectus de cet OPCVM,'
before: '���������������������������������������������������������'
 after: 'disponible sur le site internet https://fr.allianzgi.com.'
before: '��������������������'
 after: 'Performances passées'
before: '�������������������������������������������������������������������������������������������������������������������������������'
 after: 'Nous ne disposons pas des données de performances d'un exercice entier. Nous ne pouvons donc pas vous donner d'indication utile'
before: '�����������������������������'
 after: 'sur Les performances passées.'
before: '����������������������'
 after: 'Informations pratiques'
before: '���������������������������������������������������������'
 after: 'Dépositaire: State Street Bank International GmbH - Paris'
before: '������'
 after: 'Branch'
before: '�������������������������������������������������������������'
 after: 'Vous pouvez obtenir gratuitement Les documents d'informations'
before: '����������������������������������������������������'
 after: 'clés pour l'investisseurs des autres parts, copie du'
before: '�������������������������������������������������������������'
 after: 'prospectus/rapport annuel/document semestriel en français sur'
before: '��������������������������������������������������������'
 after: 'simple demande adressée à Allianz Global Investors GmbH,'
before: '�������������������������������������������������������������'
 after: 'Bockenheimer Landstrasse 42-44, D-60323 Francfort sur le Main'
before: '����������������������������������������������������������������'
 after: '— Allemagne ou à Allianz Global Investors, Succursale Française,'
before: '������������������������������������������������������������'
 after: '3 Boulevard des Italiens 75113 Paris Cedex 02 ou sur le site'
before: '����������������������������������'
 after: 'internet https://fr.allianzgi.com.'
before: '�����������������������������������������������������������������'
 after: 'La valeur liquidative ainsi que d'autres informations relatives à'
before: '�����������������������������������������������������������'
 after: 'l'OPCVM sont disponibles auprès: d'Allianz Global Investors'
before: '�����������������������������������������������������������'
 after: 'GmbH, Bockenheimer Landstrasse 42-44, D-60323 Francfort sur'
before: '����������������������������������������������������������'
 after: 'le Main — Allemagne ou auprès: d'Allianz Global Investors,'
before: '����������������������������������������������������������'
 after: 'Succursale Française, 3 Boulevard des Italiens 75113 Paris'
before: '����������������������������������������������������������'
 after: 'Cedex 02 ou sur le site internet https://fr.allianzgi.com.'
before: '������������������������������������������������������������'
 after: 'Des informations relatives à la politique de rémunération en'
before: '�������������������������������������������������������������'
 after: 'vigueur, y compris une description des méthodes de calcul des'
before: '����������������������������������������������������������'
 after: 'rémunérations et gratifications de certaines catégories de'
before: '������������������������������������������������������������'
 after: 'salariés ainsi que l'indication des personnes chargées de la'
before: '�����������������������������������������������������������������'
 after: 'répartition sont disponibles sur https://regulatory.allianzgi.com'
before: '�����������������������������������������������'
 after: 'et sur demande et sans frais en version papier.'
before: '����������������������������������������������������������������'
 after: 'L'OPCVM est soumis à la législation fiscale française. Cela peut'
before: '�����������������������������������������������������������������'
 after: 'avoir une incidence sur votre situation fiscale personnelle. Pour'
before: '�������������������������������������������������������������'
 after: 'plus d'informations, merci de vous renseigner auprès de votre'
before: '������������������'
 after: 'conseiller fiscal.'
before: '��������������������������������������������������������������'
 after: 'La responsabilité d'Allianz Global Investors GmbH ne peut être'
before: '���������������������������������������������������������'
 after: 'engagée que sur la base de déclarations contenues dans le'
before: '����������������������������������������������������������'
 after: 'présent document qui seraient trompeuses, inexactes où non'
before: '������������������������������������������������������������'
 after: 'cohérentes avec les parties correspondantes du prospectus de'
before: '��������'
 after: 'l'OPCVM.'
before: '��������������������������������������������������������������'
 after: 'Cet OPCVM est agréé en France et réglementé par l'Autorité des'
before: '������������������������������������������������������������'
 after: 'Marchés Financiers. Allianz Global Investors GmbH est agréée'
before: '����������������������������������������������������'
 after: 'en Allemagne et réglementée par la Bundesanstalt für'
before: '��������������������������������������'
 after: 'Finanzdienstleistungsaufsicht (BaFin).'
before: '�������������������������������������������������������������������'
 after: 'Les informations clés pour l'investisseur fournies ici sont exactes'
before: '������������������������'
 after: 'et à jour au 07.05.2020.'
-------------------------
OCR invocations: 72.
Pixmap time: 0.15975 (avg 0.00222) seconds.
OCR time: 36.4614 (avg 0.50641) seconds.
jeanmonet commented 3 years ago

It is a different discussion, but depending on urgency, you might be interested to look at https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/OCR/tesseract1.py which "interactively" invokes Tesseract for every text span that contains characters that cannot be interpreted. Of course a brute force approach, which is not fast but it does work in your case. Here is a log:

python tesseract1.py
...
-------------------------
OCR invocations: 72.
Pixmap time: 0.15975 (avg 0.00222) seconds.
OCR time: 36.4614 (avg 0.50641) seconds.

Thank you for pointing this out, it's a great recipe, I will certainly keep it in mind!

JorjMcKie commented 3 years ago

@jeanmonet - I just saw that Artifex/MuPDF has reacted to a similar issue I had submitted a few weeks ago. They have published a fix that indeed resolves the problem file cases in this thread (I am not through testing them all yet).

It all goes back to illegal CMAP (character map) specifications in the respective fonts used. Some of the tools (like XPDF's texttopdf) and others have already adapted to this - that's why they work.

JorjMcKie commented 3 years ago

well, font_pb.pdf" does work,11-1.pdf`` still does not.

jeanmonet commented 3 years ago

@jeanmonet - I just saw that Artifex/MuPDF has reacted to a similar issue I had submitted a few weeks ago. They have published a fix that indeed resolves the problem file cases in this thread (I am not through testing them all yet).

It all goes back to illegal CMAP (character map) specifications in the respective fonts used. Some of the tools (like XPDF's texttopdf) and others have already adapted to this - that's why they work.

Just saw this as well. Really cool that they reacted so quickly. Now I'm waiting for the conda-forge package to be updated :)

Unrelated, but do you know if there are plans for PyMuPDF to post on a conda repo as well (conda-forge maybe)?

JorjMcKie commented 3 years ago

do you know if there are plans for PyMuPDF to post on a conda repo as well (conda-forge maybe)?

It's on my list. Already had a look at it. When doing this, it will be restricted to configurations that can be served by Python wheels. I.e. I will not try to do a combined MuPDF + PyMuPDF installation from sources.

jeanmonet commented 3 years ago

do you know if there are plans for PyMuPDF to post on a conda repo as well (conda-forge maybe)?

It's on my list. Already had a look at it. When doing this, it will be restricted to configurations that can be served by Python wheels. I.e. I will not try to do a combined MuPDF + PyMuPDF installation from sources.

Great news! I personally prefer installing via conda when available because it is very easy to keep packages up to date without breaking dependencies (and use mamba for its better & faster solver).

zyc130130 commented 3 years ago

@jeanmonet - I just saw that Artifex/MuPDF has reacted to a similar issue I had submitted a few weeks ago. They have published a fix that indeed resolves the problem file cases in this thread (I am not through testing them all yet).

It all goes back to illegal CMAP (character map) specifications in the respective fonts used. Some of the tools (like XPDF's texttopdf) and others have already adapted to this - that's why they work.

Hi, I have the same problem as the issue, and when i try to use mupdf command line i got the same result. Just as you have said the reason “It all goes back to illegal CMAP (character map) specifications in the respective fonts used.”, can we set CMAP in pymupdf, I don't find CMAP dir in lib pymupdf

this is my input file: b37b0387e95694e38a6ab244f725ca61-1.pdf

JorjMcKie commented 3 years ago

I have the same problem as the issue

No, you do not, @zyc130130 - your case is entirely different. Apart from the fact that the PDF has an illegal structure, its fonts are Type 3, which by definition have no CMAP. All characters are defined inline, as a series of atomic draw commands: e.g. the capital letter "D" is drawn as a line "|" followed by a left-open semi-circle, etc. Text like this cannot be extracted - and this is the intention of the PDF creator. You cannot even copy-paste it - try Adobe or whatever. You have to resort to OCR the PDF.

JorjMcKie commented 3 years ago

@zyc130130 - install Tesseract and try this script: tesseract1.zip Or install ocrmypdf and do ocrmypdf --force-ocr --output-type pdf input.pdf output.pdf. The PDF output.pdf looks (much) like input.pdf and has extractable text ...

JorjMcKie commented 3 years ago

@zyc130130 - I just tested this:

First install ocrmypdf: python -m pip install ocrmypdf. Also make sure you have an installed Ghostscript!!! Then:

import fitz
import ocrmypdf
import io
ocrpdf = io.BytesIO() # the OCR-ed PDF will land here
ocrmypdf.ocr("input.pdf", ocrpdf, force_ocr=True, output_type="pdf")
# several messages will now appear, informing about OCR success
doc = fitz.open("pdf", ocrpdf)

This saves fiddling with intermediate files because of the OCR step. You can even have two PDFs open at the same time: (1) the original one, (2) the OCR-ed one. Whenever you receive illegible text (i.e. characters chr(0xfffd)) from the original one, extract the text from the corresponding page of the OCR version instead ...

zyc130130 commented 3 years ago

@zyc130130 - I just tested this:

First install ocrmypdf: python -m pip install ocrmypdf. Also make sure you have an installed Ghostscript!!! Then:

import fitz
import ocrmypdf
import io
ocrpdf = io.BytesIO() # the OCR-ed PDF will land here
ocrmypdf.ocr("input.pdf", ocrpdf, force_ocr=True, output_type="pdf")
# several messages will now appear, informing about OCR success
doc = fitz.open("pdf", ocrpdf)

This saves fiddling with intermediate files because of the OCR step. You can even have two PDFs open at the same time: (1) the original one, (2) the OCR-ed one. Whenever you receive illegible text (i.e. characters chr(0xfffd)) from the original one, extract the text from the corresponding page of the OCR version instead ...

Thanks for your response

zyc130130 commented 3 years ago

try Adobe or whatever.

By the way, when try to use pdfminer to extract this pdf, it got right result. Maybe Fonts Type3 will be support in mupdf.

zyc130130 commented 3 years ago

I have the same problem as the issue

No, you do not, @zyc130130 - your case is entirely different. Apart from the fact that the PDF has an illegal structure, its fonts are Type 3, which by definition have no CMAP. All characters are defined inline, as a series of atomic draw commands: e.g. the capital letter "D" is drawn as a line "|" followed by a left-open semi-circle, etc. Text like this cannot be extracted - and this is the intention of the PDF creator. You cannot even copy-paste it - try Adobe or whatever. You have to resort to OCR the PDF.

Maybe solute the problem as the url "http://git.ghostscript.com/?p=mupdf.git;a=commitdiff;h=47fc8f547ebe9243f4b373f580662e1f2019a5ec", by updata the function "fz_unicode_from_glyph_name(const char *name)"

JorjMcKie commented 3 years ago

@zyc130130 - I was wrong! Sorry about that.

Yesterday I detected a bug in the underlying library MuPDF: Type 3 fonts are supported, but when the so-called /Differences array specifies unicodes instead of glyph names (as in your case), then the wrong things happen and text is not recognized as it should be. I made those changes in my local copy of MuPDF, and it did work.

>>> import fitz
>>> doc=fitz.open("b37b0387e95694e38a6ab244f725ca61-1.pdf")
>>> page=doc[0]
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
>>> print(page.get_text())
Contents
I Nose, Paranasal Sinuses, and Face
1
Anatomy, Physiology, and Immunology
of the Nose, Paranasal Sinuses, and Face
1
Gerhard Grevers
2 Diagnostic Evaluation of the Nose
and Paranasal Sinuses
15
...

I have submitted a Pull Request to them and hope this will go into their next version.

zyc130130 commented 3 years ago

I know when using command "page.get_text("text", flags=2, clip=clip)" it would be lose some character which you have said in pymupdf document. This appear in the jpg which appendix,I want to konw page.cropbox in pymupdf is equaled trimbox which in adobe document. If they don't equal, I don't find some function to get trimbox in pumupdf.

------------------ 原始邮件 ------------------ 发件人: "pymupdf/PyMuPDF" @.>; 发送时间: 2021年7月3日(星期六) 晚上7:01 @.>; @.**@.>; 主题: Re: [pymupdf/PyMuPDF] unexpected output from getText() (#501)

@zyc130130 - I was wrong! Sorry about that.

Yesterday I detected a bug in the underlying library MuPDF: Type 3 fonts are supported, but when the so-called /Differences array specifies unicodes instead of glyph names (as in your case), then the wrong things happen and text is not recognized as it should be. I made those changes in my local copy of MuPDF, and it did work. >>> import fitz >>> doc=fitz.open("b37b0387e95694e38a6ab244f725ca61-1.pdf") >>> page=doc[0] mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object mupdf: invalid page object >>> print(page.get_text()) Contents I Nose, Paranasal Sinuses, and Face 1 Anatomy, Physiology, and Immunology of the Nose, Paranasal Sinuses, and Face 1 Gerhard Grevers 2 Diagnostic Evaluation of the Nose and Paranasal Sinuses 15 ...
I have submitted a Pull Request to them and hope this will go into the next version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

JorjMcKie commented 3 years ago

You can extract all values from the page definition (and all other PDF objects as well) using low-level code:

>>> from pprint import pprint
>>> page = doc[6]
>>> pprint(doc.xref_get_keys(page.xref))
('CropBox',
 'MediaBox',
 'Rotate',
 'Annots',
 'StructParents',
 'Resources',
 'Parent',
 'Contents',
 'Type')
>>> 

Now extract "CropBox" as an example:

>>> item = doc.xref_get_key(page.xref, "CropBox")
>>> pprint(item)
('array', '[-4.97 -9.745 547.99 718.535]')
>>>

This always is a type information and a value. Convert the value of item to a proper fitz.Rect:

>>> tmp = tuple(map(float, item[1][1:-1].split()))
>>> tmp
(-4.97, -9.745, 547.99, 718.535)
>>> cropbox = fitz.Rect(tmp)
>>> # but these are PDF coordinates!
>>> # convert them to MuPDF coordinates:
>>> cropbox = cropbox * page.transformation_matrix
>>> cropbox
Rect(0.0, 0.0, 552.9599609375, 728.2799682617188)
>>> # compare with page.rect:
>>> page.rect
Rect(0.0, 0.0, 552.9599609375, 728.2799682617188)
>>> 

Do something similar with "TrimBox" ...

PasaOpasen commented 1 year ago

@zyc130130 - I was wrong! Sorry about that.

Yesterday I detected a bug in the underlying library MuPDF: Type 3 fonts are supported, but when the so-called /Differences array specifies unicodes instead of glyph names (as in your case), then the wrong things happen and text is not recognized as it should be. I made those changes in my local copy of MuPDF, and it did work.

>>> import fitz
>>> doc=fitz.open("b37b0387e95694e38a6ab244f725ca61-1.pdf")
>>> page=doc[0]
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
mupdf: invalid page object
>>> print(page.get_text())
Contents
I Nose, Paranasal Sinuses, and Face
1
Anatomy, Physiology, and Immunology
of the Nose, Paranasal Sinuses, and Face
1
Gerhard Grevers
2 Diagnostic Evaluation of the Nose
and Paranasal Sinuses
15
...

I have submitted a Pull Request to them and hope this will go into their next version.

Is there a way to fix this inside pdf?

JorjMcKie commented 1 year ago

Is there a way to fix this inside pdf?

No, there isn't, sorry.

asapsmc commented 12 months ago

Note: I don't know if I should be opening a new issue or if commenting here is ok.

I'm having the same issue with this file (Problematic1.pdf). The page_content appears as follows: ���������������������\nu\nv\n�������������������������������������������������������������������������\n����������������������������������������������������������������������������\n����������������������������������������\ns\nt\nu\nv\ns\nt\n�������������������������\n����������������\n�������������\n������������\n��� ���������������������������\n��������\n����������\n�����\n�\n�������������\n��������������������������� ����������\n�������������������� ���������� ����������\n������������������������������������������������\n����������� ����������\n��������� ����������\n����������������������������\n \n������������\n�\n�������\n������������������������������������������\n�����������������������������\n��������\n�������\n�����\n��������������������������������������\n��������\n�������\n�����\n�����������������������������\n��������\n�������\n�����\n������������������������������\n��������\n������\n�����\n�������������\n��������\n��������\n������������������������������������������\n������������������������\n��������\n�����\n�����\n�������������������������������������������������������������������������������������������������������������������������\n������������������������������������������������������������\n��������������\n�������\n������������\n�����������������������������������������\n������������������\n��������\n�����\n�����\n�������������\n��������\n�������\n�����\n������������������\n�������\n�\n������������������\n��������\n����\n����\n������������������\n������������\n������\n��������\n�������\n�\n���������\n�������\n��������������\n���������\n���������������\n����������������������������������������������\n�������\n���������������������\n��������������������������������������������������������������������������\nEmitido por programa certificado nº 2386/AT. Este documento não serve de fatura.\n'

JorjMcKie commented 12 months ago

I can look into this next week Tuesday. In the meantime can you please try and copy / paste the text from a PDF Viewer (e.g. Adobe) to a word app and see if the text becomes readable. If not then there is no solution except OCR.

asapsmc commented 12 months ago

I can look into this next week Tuesday. In the meantime can you please try and copy / paste the text from a PDF Viewer (e.g. Adobe) to a word app and see if the text becomes readable. If not then there is no solution except OCR. Yes, I'm able to open the document without any error in Acrobat Reader and other pdf readers, and select the text and copy-paste it. This is an extract of the copy-paste:

Pág.1/3 ¤ ¥ A falta de pagamento ou o pagamento depois do prazo limite, pode implicar custos adicionais e a suspensão do serviço. Pode também originar o pagamento de uma caução ou o agravamento da mesma. ¢ £ ¤ ¥ ¢ £ Modalidades de Pagamento: · Débito Direto · Multibanco · Lojas MEO · Pontos de Venda Autorizados · Cheque · Pay Shop . CTT OS SEUS MEOS: Os seus MEOS em 14/07/2023: 1304 Detalhe da Fatura Nº A795979751 julho 2023 Este documento não é válido para efeitos fiscais Nº Cliente: 1290600729 Nº Conta: 1339848861 Data de Emissão: 14 jul 2023 MENSALIDADES PACOTES Nº Serviço 1702361504 · Vila Nova de Gaia Desconto Mensalidade Camp.Emp jul 2023 - 3,250 23,00 Desconto mensalidade fatura eletrónica jul 2023 - 0,813 23,00 Desconto Mensalidade Camp.Emp jul 2023 - 1,223 23,00 M4 (TV+NET 200+VOZ+MÓVEL 10GB) jul 2023 53,000 23,00 Total Pacotes € 47,714 TELEFONE...

Thank you

JorjMcKie commented 12 months ago

Thanks, this indeed looks like an issue then.

asapsmc commented 11 months ago

@JorjMcKie did you have some time for this?

JorjMcKie commented 11 months ago

Looking at font details

import fitz
doc=fitz.open("Problematic1.pdf")
page=doc[0]
from pprint import pprint
pprint(page.get_fonts())
[(4, 'n/a', 'Type3', 'C0RB0100_T1001140', 'C0RB0100_T1001140', ''),
 (11, 'pfa', 'Type1', 'PTBlissRegular', 'PTBlissRegular', ''),
 (14, 'pfa', 'Type1', 'MyriadPro-Regular', 'Myriad', ''),
 (17, 'n/a', 'Type1', 'C0T05590', 'C0T05590', 'WinAnsiEncoding'),
 (19, 'pfa', 'Type1', 'MyriadPro-Semibold', 'MyriadSemibold', '')]

reveals that the fonts (except xref 17) do not contain encoding information (last item being ''). In such cases, MuPDF like other packages cannot backtranslate from glyphs to unicodes. Although yet some other packages maybe successful in doing some guess-work, you are still left with the only alternative to do OCR here.

yungsinatra0 commented 11 months ago

Hi @JorjMcKie, I'm getting a mix of normal characters and the Unicode character "�" when both extracting using PyMuPDF and manually copying the text from Adobe PDF reader. Is it possible to fix this somehow or would only OCR work in this case? If OCR is the solution, should it be done as in the reply below?

@zyc130130 - I just tested this:

First install ocrmypdf: python -m pip install ocrmypdf. Also make sure you have an installed Ghostscript!!! Then:

import fitz
import ocrmypdf
import io
ocrpdf = io.BytesIO() # the OCR-ed PDF will land here
ocrmypdf.ocr("input.pdf", ocrpdf, force_ocr=True, output_type="pdf")
# several messages will now appear, informing about OCR success
doc = fitz.open("pdf", ocrpdf)

This saves fiddling with intermediate files because of the OCR step. You can even have two PDFs open at the same time: (1) the original one, (2) the OCR-ed one. Whenever you receive illegible text (i.e. characters chr(0xfffd)) from the original one, extract the text from the corresponding page of the OCR version instead ...

edit: forgot to mention that due to the sensitive nature of the PDF document, I am unable to share it (or any content from it)

JorjMcKie commented 11 months ago

@yungsinatra0 - as said, in those cases OCR is the only way out. Are you aware of PyMuPDF's inbuilt Tesseract-OCR interface? Especially in cases where there are 0cFFFD characters only intermittently, the script logic explained here may help you more on the spot.

yungsinatra0 commented 11 months ago

@JorjMcKie Thanks for the quick reply! No, I was not aware of the inbuilt Tesseract-OCR interface. Is there a guide that shows an example on how to use this?

JorjMcKie commented 11 months ago

@JorjMcKie Thanks for the quick reply! No, I was not aware of the inbuilt Tesseract-OCR interface. Is there a guide that shows an example on how to use this?

The mentioned script is one of those examples. Please look in the folder where this script lives to see more.

yungsinatra0 commented 11 months ago

@JorjMcKie Thanks for the quick reply! No, I was not aware of the inbuilt Tesseract-OCR interface. Is there a guide that shows an example on how to use this?

The mentioned script is one of those examples. Please look in the folder where this script lives to see more.

Thanks, will take a look!

Soumadip-Saha commented 6 months ago

Looking at font details

import fitz
doc=fitz.open("Problematic1.pdf")
page=doc[0]
from pprint import pprint
pprint(page.get_fonts())
[(4, 'n/a', 'Type3', 'C0RB0100_T1001140', 'C0RB0100_T1001140', ''),
 (11, 'pfa', 'Type1', 'PTBlissRegular', 'PTBlissRegular', ''),
 (14, 'pfa', 'Type1', 'MyriadPro-Regular', 'Myriad', ''),
 (17, 'n/a', 'Type1', 'C0T05590', 'C0T05590', 'WinAnsiEncoding'),
 (19, 'pfa', 'Type1', 'MyriadPro-Semibold', 'MyriadSemibold', '')]

reveals that the fonts (except xref 17) do not contain encoding information (last item being ''). In such cases, MuPDF like other packages cannot backtranslate from glyphs to unicodes. Although yet some other packages maybe successful in doing some guess-work, you are still left with the only alternative to do OCR here.

Hi @JorjMcKie. Thanks a lot for this great package you have created. I am also facing a similar issue here. Basically I have thousands of documents, some of which contains text which has no encoding in the fonts type. I would like to skip those texts when extracting texts from those pages but I am not sure how to do that. Can you please help me with that? It would be very helpful if you can give me some insights. And thanks once again for your helps.

JorjMcKie commented 6 months ago

@Soumadip-Saha In general, every single character may have been written with its own "personal" font, which may be different from that of neighboring characters. This of course is a rare thing, but important to keep in mind.

You must extract text with a variant that also delivers the font name - i.e. use page.get_text("dict", ...). This returns a fairly complex, hierarchical dictionary structure documented here.

The lowest hierarchy level is the "span", which represents a piece of text within a line, that contains characters with identical properties, specifically they have the same font.

An additional complication is that normally only the font's "self name" is given in the span dictionary, while in most cases PDFs contain subset fonts: these are subsets of the original where all unused characters (actually glyphs) have been removed to reduce file size. To unambiguously identify the font you must make sure to get the full font name - which includes the subset identifier: a 6-character upper case ASCII string followed by a "+".

Enough prolog. Here is a snippet that should give you a start:

fitz.TOOLS.set_subset_fontnames(True)  # ensure to be given full fontnames (global parameter)
for page in doc:
    # make list of font names that have no encoding
    badfonts = [f[3] for f in page.get_fonts() if f[-1] == ""]
    blocks = page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]  # text blocks only
    for block in blocks:
        for line in block["lines"]
            for span in line["spans"]:
                if span["font"] in badfonts:  # skip this text piece!
                    continue
                text = span["text"]  # text with a healty font