metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

Check for Unicode chars in PDF files #13

Closed davemcphee closed 8 years ago

davemcphee commented 8 years ago

C:\Users\User\Desktop\python\CouncilAgendaMapper>pdfx --debug agenda.pdf DEBUG - init - Init with uri: agenda.pdf Document infos:

Not sure what the correct way to handle encoding errors would be, skip the char?

I usually fix this kind of issue by changing the cmd window encoding, eg.: cmd> chcp 65001

metachris commented 8 years ago

Thanks for reporting! Can you send me the pdf which produces the error? On Nov 18, 2015 19:44, "davemcphee" notifications@github.com wrote:

C:\Users\User\Desktop\python\CouncilAgendaMapper>pdfx --debug agenda.pdf DEBUG - init - Init with uri: agenda.pdf Document infos:

  • Author = blah
  • CreationDate = D:20151106153359-06'00' Traceback (most recent call last): File "C:\Python33\lib\runpy.py", line 160, in _run_module_as_main "main", fname, loader, pkg_name) File "C:\Python33\lib\runpy.py", line 73, in _run_code exec(code, run_globals) File "C:\Python33\Scripts\pdfx.exemain.py", line 9, in File "C:\Python33\lib\site-packages\pdfx\cli.py", line 90, in main print("- %s = %s" % (k, parse_str(v).strip("/"))) File "C:\Python33\lib\encodings\cp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\xae' in position 21 : character maps to

Not sure what the correct way to handle encoding errors would be, skip the char?

I usually fix this kind of issue by changing the cmd window encoding, eg.: cmd> chcp 65001

— Reply to this email directly or view it on GitHub https://github.com/metachris/pdfx/issues/13.

davemcphee commented 8 years ago

Hi Chris,

sure thing, I think it was this: http://www.austintexas.gov/edims/document.cfm?id=242173

I know nothing about PDF encoding, but I do know about parsing HTML from that source, and it's all generated in MS Word, so expect a horror show inside that PDF. Good luck :)

On Wed, Nov 18, 2015 at 4:24 PM Chris Hager notifications@github.com wrote:

Thanks for reporting! Can you send me the pdf which produces the error? On Nov 18, 2015 19:44, "davemcphee" notifications@github.com wrote:

C:\Users\User\Desktop\python\CouncilAgendaMapper>pdfx --debug agenda.pdf DEBUG - init - Init with uri: agenda.pdf Document infos:

  • Author = blah
  • CreationDate = D:20151106153359-06'00' Traceback (most recent call last): File "C:\Python33\lib\runpy.py", line 160, in _run_module_as_main "main", fname, loader, pkg_name) File "C:\Python33\lib\runpy.py", line 73, in _run_code exec(code, run_globals) File "C:\Python33\Scripts\pdfx.exemain.py", line 9, in File "C:\Python33\lib\site-packages\pdfx\cli.py", line 90, in main print("- %s = %s" % (k, parse_str(v).strip("/"))) File "C:\Python33\lib\encodings\cp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\xae' in position 21 : character maps to

Not sure what the correct way to handle encoding errors would be, skip the char?

I usually fix this kind of issue by changing the cmd window encoding, eg.: cmd> chcp 65001

— Reply to this email directly or view it on GitHub https://github.com/metachris/pdfx/issues/13.

— Reply to this email directly or view it on GitHub https://github.com/metachris/pdfx/issues/13#issuecomment-157884230.

metachris commented 8 years ago

Just a note: I've found out that this problem is caused by the Windows console not being able to display unicode characters. I'm going to look into a solution.

metachris commented 8 years ago

Should work now on the Github version. CLI tries to encode contents to console encoding: https://github.com/metachris/pdfx/blob/b52d3a72b18c7331f025b1d9fec67387fc5dd2d1/pdfx/cli.py#L111