smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

Hex (?) output #121

Open dearsina opened 8 years ago

dearsina commented 8 years ago

I use the parser to extract insurance policies I get from my insurer. Recently, they changed systems, and suddenly, when I try to parse their PDFs, the output looks like hex. See below for a sample.

536563757269747920436F64653A4141413230304339333238423442463338464646344436323838314239393045436C69656E7420436F64653A484156465433302D3230303735392F3049737375696E67204167656E743A20204D6F746F7263616465204C6F6E646F6E20466C656574204167656E74506167652031206F662031436F6E746163742054656C2E3A203032303737343131303530494D504F5254414E545448455345204E4F5445532041524520464F52 20 594F55522047554944414E4345546F20656E7375726520 66756C6C2070726F74656374696F6E20 756E64657220796F757220 706F6C69637920697420697320 657373656E7469616C 20746F20 6E6F7469667920 796F757220 4167656E7420776974682077686F6D20 796F7520617272616E676564 207468697320 696E737572616E6365206F66 20616E7920 6368616E676520 746F2074686520 64657461696C73 20

What can I do to translate this into human-readable text? Is there a setting I'm missing somewhere, or is this a fault with pdfparser?

I'm working with a single-page PDF. Using any online extractor, I am able to get the text out, so the PDF is fine, I think I need to change something in my process. Any help would be much appreciated.

Connum commented 4 years ago

This might possibly have been fixed by #344, but we'd need a sample PDF to test against. If you can't provide the PDF for copyright/privacy reasons, it would be great if you could check that fix against your file.