SuryaViswanath commented 3 years ago

Thanks for this amazing library.

365 I was trying to follow the following issue however I couldn't follow through to the end to have a workaround for my project. I had the same Identity-H mapping when using `getFontList()` and the `getText("rawdict") is as follows

issue

PyMuPDF version: 1.18.0

I cannot share the sample pdf file to help you recreate the issue but I can be actively communicating with you to give all the details you need.

JorjMcKie commented 3 years ago

This question mark symbol is generated by MuPDF directly -- my C base library, for which I am providing the PyMuPDF bindings. The important info here is: I have no influence on this, it can't be changed.

The symbol appeares always, when MuPDF encounters a character that is not UTF-8 encoded. And MuPDF is usually correct in telling us this: make an "outside" cross check by copying the text from inside some PDF viewer and then pasting it in Word.

You should see, that Word too does not understand what this character means.

The only way out is OCR-ing the PDF and then using the result ... Some background here: inside the font's fontfile program, the visual appearance (the "glyph") and its originating unicode are two different things. From seeing a glyph in the PDF, you cannot deduct the glyph number and hence cannot find out the unicode.

SuryaViswanath commented 3 years ago

Thanks for the quick reply, however the same file I am able to read using pdfminer I am able to extract the text. So do you think there is a possibility in the future to have this issue resolved?

thanks

JorjMcKie commented 3 years ago

So do you think there is a possibility in the future to have this issue resolved?

Excuse this sarcastic answer: Without granting me a reproducing PDF? No.

BTW I find this amazing - almost unbelievable.

SuryaViswanath commented 3 years ago

I am so sorry if I sound inappropriate but I am trying to understand what is unbelievable? I will see what I can do about giving you a pdf file for a chance to work on it until then

JorjMcKie commented 3 years ago

unbelievable

It is hard to believe, that a purePython library like pdfminer is able to do it ... compared with a - after all - commercial product like MuPDF.

SuryaViswanath commented 3 years ago

I am sure there are possibilities at every corner, with that being said I have a pdf file where I edited most of the document while it still has the � values in quite a lot of places. I am not sure how close the file is to actual encoding present but I think this should be similar. Please do let me know if I can provide any other details testFile.pdf

JorjMcKie commented 3 years ago

Thanks! I will into it.

JorjMcKie commented 3 years ago

Uhm ... and pdfminer is able to extract from this guy ...?

JorjMcKie commented 3 years ago

Before I make myself knowledgeable with pdfminer, could you please send me your pdfminer script to print the text? Thanks!

SuryaViswanath commented 3 years ago

I hope this helps,since you aldready have recreated pdf file I am removing above comment for my personal reasons `from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice from pdfminer.layout import LAParams from pdfminer.converter import PDFPageAggregator import pdfminer import pandas as pd

Open a PDF file.

fp = open(r"image-copy.pdf", 'rb')

fp = open(r"C:\Users\nk3\Documents\pdf highlight\new\MSPIN001368_MIL-24889 Rev 10.0.pdf", 'rb')

Create a PDF parser object associated with the file object.

parser = PDFParser(fp)

Create a PDF document object that stores the document structure.

Password for initialization as 2nd parameter

document = PDFDocument(parser)

Check if the document allows text extraction. If not, abort.

if not document.is_extractable: raise PDFTextExtractionNotAllowed

Create a PDF resource manager object that stores shared resources.

rsrcmgr = PDFResourceManager()

Create a PDF device object.

device = PDFDevice(rsrcmgr)

BEGIN LAYOUT ANALYSIS

Set parameters for analysis.

laparams = LAParams()

Create a PDF page aggregator object.

device = PDFPageAggregator(rsrcmgr, laparams=laparams)

Create a PDF interpreter object.

interpreter = PDFPageInterpreter(rsrcmgr, device)

pdf_contents_df = pd.DataFrame(columns = ["x1", "y1", "x2", "y2", "content", "page"])

def parse_obj(lt_objs, pdf_contents_df, pg): st = ""

loop over the object list

for obj in lt_objs:

    # if it's a textbox, print text and location

    if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
        txt = obj.get_text().replace('\n', '')
        #print(txt)
        st = txt.split("(")
        st = ["("+i for i in st[1:]]
        text_val = ""
        for i in st:
            text_str = i.strip()    

            if 'cid' in text_str.lower():
                text_str = text_str.strip('(')
                text_str = text_str.strip(')')
                ascii_num = text_str.split(':')[-1]
                ascii_num = int(ascii_num)
                text_val += str(chr(ascii_num))  # 66 = 'B' in ascii

        #print(text_val)
        #print("%6d, %6d,%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.bbox[2], obj.bbox[3], obj.get_text().replace('\n', '')))
        print("%6d, %6d,%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.bbox[2], obj.bbox[3], text_val))
        pdf_contents_df = pdf_contents_df.append({"x1" : obj.bbox[0], "y1" : obj.bbox[1], "x2" : obj.bbox[2], "y2" : obj.bbox[3], "content" : text_val, "page" : pg}, ignore_index = True)

        st += obj.get_text().replace('\n', '')
    # if it's a container, recurse
    elif isinstance(obj, pdfminer.layout.LTFigure):
        print("elif")
        pdf_contents_df = parse_obj(obj._objs, pdf_contents_df, pg)

#print(st) 
return pdf_contents_df

loop over all pages in the document

i = 0 for page in PDFPage.create_pages(document):

# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
print(i)
i += 1
# extract text from this object
pdf_contents_df = parse_obj(layout._objs, pdf_contents_df, i)`

JorjMcKie commented 3 years ago

Oha, not a simple thing! Look, I will just drop that at this point. It is an upstream bug anyway (MuPDF), and I have submitted an issue in their bug system using your PDF - which is ok I hope. This is the URL for your reference: https://bugs.ghostscript.com/show_bug.cgi?id=703213.

SuryaViswanath commented 3 years ago

I have submitted an issue in their bug system using your PDF - which is ok I hope.

well thanks for the timely help. If you don't mind I would send in another file for this purpose if you don't mind would you kindly take that down and replace that with the file I will attach. Please and thank you

JorjMcKie commented 3 years ago

ok, go ahead

JorjMcKie commented 3 years ago

BTW you can also send that file to them and change the issue I sent you. That system is open to everyone.

JorjMcKie commented 3 years ago

If you could do that, it would be grate, because it demonstrates a broader need.

SuryaViswanath commented 3 years ago

Hey, @JorjMcKie what kind of encoded pdf files are currently good to read the text out of. Seems like the file attached had some Identity-H mapping involved. Please if you could help me understand at current version what type of encoded pdf files are suitable for the fitz module. Thanks

JorjMcKie commented 3 years ago

I think you mean what font types are supported by MuPDF? Well, before your example I was convinced, that there are no real restrictions here. If the ?-symbol occurred, then this was for the valid reason that the character really has no UTF-8 representation. Your example however showed that there are some support issues for fonts with a /CIDToGIDMap specification.

JorjMcKie commented 3 years ago

I think that special specification, Identity-H plus (!!!) ToUnicode=Identity-H (*) confuses MuPDF, so it does not look further for the CID to GID mappping ...

(*) Specifying /ToUnicode /Iddentity-Has it happens in the example file looks like being an illegal specification for Type 0 fonts by the books, see page 453 here: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf.

bafonso commented 3 years ago

Did this ever get fixed? I'm running into the same problem. The given sample pdf still shows problems too. It seemed like a commit fixed it but I wonder if pymupdf already uses this.

JorjMcKie commented 3 years ago

Yes there was an update in MuPDF. But this is still in their development stage and not released yet. So it is not available as an official code base for PyMuPDF. The next MuPDF version cannot be far off ...

SAIVENKATARAJU commented 3 years ago

I am also facing the same issue. @SuryaViswanath did you able to find workaround for this?.

ritesh17k commented 2 years ago

Hi @JorjMcKie, could you pls update the thread if the support for this has gone in PyMuPDF as I still am facing this issue on my PDFs. Thanks.

JorjMcKie commented 2 years ago

@ritesh17k - there will never be a general solution to this, because some PDFs simply are built with fonts which do not contain a backtranslation from glyphs to unicodes. Only with this type information (usually the array /ToUnicode) text extraction can work. What I can say is that the text of the (only) example file in this thread testFile.pdf above can be extracted with the current PyMuPDF version.

pymupdf / PyMuPDF

Question / Comment: fitz returns text with � when reading the pdf file #741

365 I was trying to follow the following issue however I couldn't follow through to the end to have a workaround for my project. I had the same Identity-H mapping when using `getFontList()` and the `getText("rawdict") is as follows

Open a PDF file.

fp = open(r"C:\Users\nk3\Documents\pdf highlight\new\MSPIN001368_MIL-24889 Rev 10.0.pdf", 'rb')

Create a PDF parser object associated with the file object.

Create a PDF document object that stores the document structure.

Password for initialization as 2nd parameter

Check if the document allows text extraction. If not, abort.

Create a PDF resource manager object that stores shared resources.

Create a PDF device object.

BEGIN LAYOUT ANALYSIS

Set parameters for analysis.

Create a PDF page aggregator object.

Create a PDF interpreter object.

loop over the object list

loop over all pages in the document

pymupdf / PyMuPDF

Question / Comment: fitz returns text with � when reading the pdf file #741

365 I was trying to follow the following issue however I couldn't follow through to the end to have a workaround for my project. I had the same Identity-H mapping when using getFontList() and the `getText("rawdict") is as follows

Open a PDF file.

fp = open(r"C:\Users\nk3\Documents\pdf highlight\new\MSPIN001368_MIL-24889 Rev 10.0.pdf", 'rb')

Create a PDF parser object associated with the file object.

Create a PDF document object that stores the document structure.

Password for initialization as 2nd parameter

Check if the document allows text extraction. If not, abort.

Create a PDF resource manager object that stores shared resources.

Create a PDF device object.

BEGIN LAYOUT ANALYSIS

Set parameters for analysis.

Create a PDF page aggregator object.

Create a PDF interpreter object.

loop over the object list

loop over all pages in the document

365 I was trying to follow the following issue however I couldn't follow through to the end to have a workaround for my project. I had the same Identity-H mapping when using `getFontList()` and the `getText("rawdict") is as follows