maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

Error when trying to read a pdf with Identity-H encoding #77

Closed botev closed 3 years ago

botev commented 3 years ago

Trying to load a pdf which apparently has an Identity-H encoding I get the error:

[Errno 2] No such file or directory: '/usr/local/lib/python3.7/dist-packages/pdfreader/codecs/cmaps/Identity-H'

I'm seeing that there is potentially a correct codecs folder, but is it possible to install codecs or are there any workarounds for this?

mstgrant commented 3 years ago

I'm currently getting the same error as well:

FileNotFoundError: [Errno 2] No such file or directory: 'Python38-32\lib\site-packages\pdfreader\codecs\cmaps\Identity-H'

However, I get this error when I try to walk through all the document’s pages and extract their data

maxpmaxp commented 3 years ago

@botev @mstgrant can you share your pdf and your code when you see this error?

aleksandar-devedzic commented 3 years ago

I have the same issue, this is my code with pdf example:

pdf_link = 'https://www.neerach.ch/public/upload/assets/1417/MTB0321.pdf'
response = requests.get(pdf_link, stream=True)
my_raw_data = response.content

#extract text page by page
with BytesIO(my_raw_data) as data:

    viewer = SimplePDFViewer(data)

    # get all pages
    all_pages = [p for p in viewer.doc.pages()]
    total_page_num = len(all_pages)

    for page_number in range(1, total_page_num+1):
        viewer.navigate(int(page_number))
        viewer.render()
        page_strings = "".join(viewer.canvas.strings)
        page_strings = page_strings.strip().replace('     ', '\n\n').strip()
        page_strings = page_strings.replace('  ', '\n\n').strip()

        # joining text from all pages
        full_pdf_text += page_strings + '\n\nPage: ' + str(page_number) + ' / ' + str(total_page_num) + '\n\n\n'

    print(full_pdf_text)

The error that I get: FileNotFoundError: [Errno 2] No such file or directory: '/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pdfreader/codecs/cmaps/Identity-H' I have tried that other PDF extraction libs and the result that I get is: ...(cid:57)(cid:72)(cid:85)(cid:75)(cid:68)(cid:81)... That some strange type of encoding, but If i translate the numbers to ASCII i do not get the correct chars.

maxpmaxp commented 3 years ago

@aleksandar-devedzic I can't reproduce the issue. What pdfreader version do you use?

maxpmaxp commented 3 years ago

@aleksandar-devedzic I just reproduced it. Working on a patch.

maxpmaxp commented 3 years ago

@aleksandar-devedzic can you check v0.1.10 please? Should be good.

aleksandar-devedzic commented 3 years ago

I have installed the latest version, but it does not work good. I guess that you replaced values for (cid:) with ascii. I have tried that, but it didnt worked. This is the result that Im getting: SUHVVXP\x031HHUDFKHU\x030LWWHLOXQJVEODWW\x035HGDNWLRQ\x03_\x03/D\\RXW\x03*HPHLQGHYHUZDOWXQJ\x031HHUDFK\x03\x03\x03\x037LWHOELOG\x03(GLWK\x036HQQ\x03_\x031HHUDFK\x03\x03\x03$XIODJH\x03XQG\x039HUVDQG\x03\x14µ\x19\x19\x13\x03([HPSODUH\x03_\x035HF\\FOLQJSDSLHU\x03_\x03HUVFKHLQW\x03PRQDWOLFK\x03\x03\x03DQ\x03DOOH\x03+DXVKDOWXQJHQ\x03GHU\x03*HPHLQGH\x031HHUDFK\x03\x03\x03'UXFN\x03JQGUXFN\x03$*\x03_\x03%DFKHQE\x81ODFK\x03\x03\x035HGDNWLRQVVFKOXVV\x03MHZHLOV\x03GHU\x03\x14\x15\x11\x037DJ\x03GHV\x030RQDWV\x03\x03\x03\n\nPage: 2 / 32\n\n\n9HUKDQGOXQJHQ\x03GHV\x03*HPHLQGHUDWHV\x03\x16_\x15\x13\x15\x14\x03\x03\x03\x14\x03&RURQDYLUXV\x03,QIRUPDWLRQHQ\x03_\x036WDQG\x03\x14\x19\x11\x03)HEUXDU\x03\x15\x13\x15\x14\x036lPWOLFKH\x03,QIRUP]()

I guess that you translated numbers inside the (cid:) to chars with ascii. In that way, you will get chars, but not correct chars. I fixed the issue with adding number 29 to each number inside (cid:... )

maxpmaxp commented 3 years ago

@aleksandar-devedzic v0.1.10 is on pypi now https://pypi.org/project/pdfreader/0.1.10/

I've created an issue for the decoding problem you described. See #81