Closed botev closed 3 years ago
I'm currently getting the same error as well:
FileNotFoundError: [Errno 2] No such file or directory: 'Python38-32\lib\site-packages\pdfreader\codecs\cmaps\Identity-H'
However, I get this error when I try to walk through all the document’s pages and extract their data
@botev @mstgrant can you share your pdf and your code when you see this error?
I have the same issue, this is my code with pdf example:
pdf_link = 'https://www.neerach.ch/public/upload/assets/1417/MTB0321.pdf'
response = requests.get(pdf_link, stream=True)
my_raw_data = response.content
#extract text page by page
with BytesIO(my_raw_data) as data:
viewer = SimplePDFViewer(data)
# get all pages
all_pages = [p for p in viewer.doc.pages()]
total_page_num = len(all_pages)
for page_number in range(1, total_page_num+1):
viewer.navigate(int(page_number))
viewer.render()
page_strings = "".join(viewer.canvas.strings)
page_strings = page_strings.strip().replace(' ', '\n\n').strip()
page_strings = page_strings.replace(' ', '\n\n').strip()
# joining text from all pages
full_pdf_text += page_strings + '\n\nPage: ' + str(page_number) + ' / ' + str(total_page_num) + '\n\n\n'
print(full_pdf_text)
The error that I get:
FileNotFoundError: [Errno 2] No such file or directory: '/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pdfreader/codecs/cmaps/Identity-H'
I have tried that other PDF extraction libs and the result that I get is:
...(cid:57)(cid:72)(cid:85)(cid:75)(cid:68)(cid:81)...
That some strange type of encoding, but If i translate the numbers to ASCII i do not get the correct chars.
@aleksandar-devedzic I can't reproduce the issue. What pdfreader
version do you use?
@aleksandar-devedzic I just reproduced it. Working on a patch.
@aleksandar-devedzic can you check v0.1.10 please? Should be good.
I have installed the latest version, but it does not work good.
I guess that you replaced values for (cid:SUHVVXP\x031HHUDFKHU\x030LWWHLOXQJVEODWW\x035HGDNWLRQ\x03_\x03/D\\RXW\x03*HPHLQGHYHUZDOWXQJ\x031HHUDFK\x03\x03\x03\x037LWHOELOG\x03(GLWK\x036HQQ\x03_\x031HHUDFK\x03\x03\x03$XIODJH\x03XQG\x039HUVDQG\x03\x14µ\x19\x19\x13\x03([HPSODUH\x03_\x035HF\\FOLQJSDSLHU\x03_\x03HUVFKHLQW\x03PRQDWOLFK\x03\x03\x03DQ\x03DOOH\x03+DXVKDOWXQJHQ\x03GHU\x03*HPHLQGH\x031HHUDFK\x03\x03\x03'UXFN\x03JQGUXFN\x03$*\x03_\x03%DFKHQE\x81ODFK\x03\x03\x035HGDNWLRQVVFKOXVV\x03MHZHLOV\x03GHU\x03\x14\x15\x11\x037DJ\x03GHV\x030RQDWV\x03\x03\x03\n\nPage: 2 / 32\n\n\n9HUKDQGOXQJHQ\x03GHV\x03*HPHLQGHUDWHV\x03\x16_\x15\x13\x15\x14\x03\x03\x03\x14\x03&RURQDYLUXV\x03,QIRUPDWLRQHQ\x03_\x036WDQG\x03\x14\x19\x11\x03)HEUXDU\x03\x15\x13\x15\x14\x036lPWOLFKH\x03,QIRUP]()
I guess that you translated numbers inside the (cid:
@aleksandar-devedzic v0.1.10 is on pypi now https://pypi.org/project/pdfreader/0.1.10/
I've created an issue for the decoding problem you described. See #81
Trying to load a pdf which apparently has an Identity-H encoding I get the error:
I'm seeing that there is potentially a correct codecs folder, but is it possible to install codecs or are there any workarounds for this?