prabhu / pdf2json

Automatically exported from code.google.com/p/pdf2json
0 stars 0 forks source link

Convert other encodings to UTF8 #8

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Usually, JSON strings are supposed to be UTF-8. 

However, when I decode a Latin-1 PDF, I get a JSON file file Latin-1 data.

It would be useful if things were converted automatically, since many JSON 
tools and libraries expect the correct encoding.

Please see the attachment for a test file and the corresponding result (using 
pdf2json 0.68).

Original issue reported on code.google.com by marcel.r...@googlemail.com on 15 Apr 2014 at 11:03

Attachments:

GoogleCodeExporter commented 9 years ago
You need to pass the -enc argument to pdf2json:

    $ pdf2json -enc UTF-8 test.pdf foo-with-utf8.json

Original comment by zaki.mug...@gmail.com on 20 Apr 2014 at 4:12

GoogleCodeExporter commented 9 years ago
What the hell I am doing wrong,

The above given file test.pdf if I produce the json with pdf2json this shows 
the same latin chanracters.

But if I use felxpaper publisher this gives the correct encoding.

I am of course using UTF-8 but nothing works
I am on windows 8

Original comment by tan...@gmail.com on 3 Dec 2014 at 6:01