Closed SuryaViswanath closed 3 years ago
This question mark symbol is generated by MuPDF directly -- my C base library, for which I am providing the PyMuPDF bindings. The important info here is: I have no influence on this, it can't be changed.
The symbol appeares always, when MuPDF encounters a character that is not UTF-8 encoded. And MuPDF is usually correct in telling us this: make an "outside" cross check by copying the text from inside some PDF viewer and then pasting it in Word.
You should see, that Word too does not understand what this character means.
The only way out is OCR-ing the PDF and then using the result ... Some background here: inside the font's fontfile program, the visual appearance (the "glyph") and its originating unicode are two different things. From seeing a glyph in the PDF, you cannot deduct the glyph number and hence cannot find out the unicode.
Thanks for the quick reply, however the same file I am able to read using pdfminer I am able to extract the text. So do you think there is a possibility in the future to have this issue resolved?
thanks
So do you think there is a possibility in the future to have this issue resolved?
Excuse this sarcastic answer: Without granting me a reproducing PDF? No.
BTW I find this amazing - almost unbelievable.
I am so sorry if I sound inappropriate but I am trying to understand what is unbelievable? I will see what I can do about giving you a pdf file for a chance to work on it until then
unbelievable
It is hard to believe, that a purePython library like pdfminer is able to do it ... compared with a - after all - commercial product like MuPDF.
I am sure there are possibilities at every corner, with that being said I have a pdf file where I edited most of the document while it still has the � values in quite a lot of places. I am not sure how close the file is to actual encoding present but I think this should be similar. Please do let me know if I can provide any other details testFile.pdf
Thanks! I will into it.
Uhm ... and pdfminer is able to extract from this guy ...?
Before I make myself knowledgeable with pdfminer, could you please send me your pdfminer script to print the text? Thanks!
I hope this helps,since you aldready have recreated pdf file I am removing above comment for my personal reasons `from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice from pdfminer.layout import LAParams from pdfminer.converter import PDFPageAggregator import pdfminer import pandas as pd
fp = open(r"image-copy.pdf", 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
if not document.is_extractable: raise PDFTextExtractionNotAllowed
rsrcmgr = PDFResourceManager()
device = PDFDevice(rsrcmgr)
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pdf_contents_df = pd.DataFrame(columns = ["x1", "y1", "x2", "y2", "content", "page"])
def parse_obj(lt_objs, pdf_contents_df, pg): st = ""
for obj in lt_objs:
# if it's a textbox, print text and location
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
txt = obj.get_text().replace('\n', '')
#print(txt)
st = txt.split("(")
st = ["("+i for i in st[1:]]
text_val = ""
for i in st:
text_str = i.strip()
if 'cid' in text_str.lower():
text_str = text_str.strip('(')
text_str = text_str.strip(')')
ascii_num = text_str.split(':')[-1]
ascii_num = int(ascii_num)
text_val += str(chr(ascii_num)) # 66 = 'B' in ascii
#print(text_val)
#print("%6d, %6d,%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.bbox[2], obj.bbox[3], obj.get_text().replace('\n', '')))
print("%6d, %6d,%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.bbox[2], obj.bbox[3], text_val))
pdf_contents_df = pdf_contents_df.append({"x1" : obj.bbox[0], "y1" : obj.bbox[1], "x2" : obj.bbox[2], "y2" : obj.bbox[3], "content" : text_val, "page" : pg}, ignore_index = True)
st += obj.get_text().replace('\n', '')
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
print("elif")
pdf_contents_df = parse_obj(obj._objs, pdf_contents_df, pg)
#print(st)
return pdf_contents_df
i = 0 for page in PDFPage.create_pages(document):
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
print(i)
i += 1
# extract text from this object
pdf_contents_df = parse_obj(layout._objs, pdf_contents_df, i)`
Oha, not a simple thing! Look, I will just drop that at this point. It is an upstream bug anyway (MuPDF), and I have submitted an issue in their bug system using your PDF - which is ok I hope. This is the URL for your reference: https://bugs.ghostscript.com/show_bug.cgi?id=703213.
I have submitted an issue in their bug system using your PDF - which is ok I hope.
well thanks for the timely help. If you don't mind I would send in another file for this purpose if you don't mind would you kindly take that down and replace that with the file I will attach. Please and thank you
ok, go ahead
BTW you can also send that file to them and change the issue I sent you. That system is open to everyone.
If you could do that, it would be grate, because it demonstrates a broader need.
Hey, @JorjMcKie what kind of encoded pdf files are currently good to read the text out of. Seems like the file attached had some Identity-H mapping involved. Please if you could help me understand at current version what type of encoded pdf files are suitable for the fitz module. Thanks
I think you mean what font types are supported by MuPDF?
Well, before your example I was convinced, that there are no real restrictions here. If the ?-symbol occurred, then this was for the valid reason that the character really has no UTF-8 representation.
Your example however showed that there are some support issues for fonts with a /CIDToGIDMap
specification.
I think that special specification, Identity-H plus (!!!) ToUnicode=Identity-H (*) confuses MuPDF, so it does not look further for the CID to GID mappping ...
(*) Specifying
/ToUnicode /Iddentity-H
as it happens in the example file looks like being an illegal specification for Type 0 fonts by the books, see page 453 here: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf.
Did this ever get fixed? I'm running into the same problem. The given sample pdf still shows problems too. It seemed like a commit fixed it but I wonder if pymupdf already uses this.
Yes there was an update in MuPDF. But this is still in their development stage and not released yet. So it is not available as an official code base for PyMuPDF. The next MuPDF version cannot be far off ...
I am also facing the same issue. @SuryaViswanath did you able to find workaround for this?.
Hi @JorjMcKie, could you pls update the thread if the support for this has gone in PyMuPDF as I still am facing this issue on my PDFs. Thanks.
@ritesh17k - there will never be a general solution to this, because some PDFs simply are built with fonts which do not contain a backtranslation from glyphs to unicodes.
Only with this type information (usually the array /ToUnicode
) text extraction can work.
What I can say is that the text of the (only) example file in this thread testFile.pdf
above can be extracted with the current PyMuPDF version.
Thanks for this amazing library.
365 I was trying to follow the following issue however I couldn't follow through to the end to have a workaround for my project. I had the same Identity-H mapping when using
getFontList()
and the `getText("rawdict") is as followsPyMuPDF version: 1.18.0
I cannot share the sample pdf file to help you recreate the issue but I can be actively communicating with you to give all the details you need.