jzillmann / pdf-to-markdown

A PDF to Markdown converter
https://pdf2md.morethan.io
MIT License
1.17k stars 189 forks source link

UTF-8 Support #5

Closed sati-bodhi closed 6 years ago

sati-bodhi commented 6 years ago

Is it possible to add utf-8 support to the app? Trying to convert pdfs with CJK characters ended up with garbled text.

jzillmann commented 6 years ago

Hey @sati-bodhi , the site is using https://mozilla.github.io/pdf.js/ for the underlying parsing. Can you check if your PDF renders correctly with it ? You should be able to test through https://mozilla.github.io/pdf.js/web/viewer.html

sati-bodhi commented 6 years ago

Thanks, @jzillmann! I just cloned the code. How do I go about testing?


I am following the instructions from readme.md right now. Will update you how it goes.


I got to the server part, but I am not sure how to initiate the conversion program from there.

default

jzillmann commented 6 years ago

Hmm, i'm confused a bit.. You cloned pdf.js or pdf-markdown ? As a first step i would recommend to just upload your PDF to https://mozilla.github.io/pdf.js/web/viewer.html. If this online viewer displays your PDF correctly, we know that pdf.js ia able to parse your PDF and we can go a step further, if not, we probably reached a dead end...

sati-bodhi commented 6 years ago

I cloned pdf.js I did the upload and it showed correctly. default

sati-bodhi commented 6 years ago

The http://pdf2md.morethan.io/ engine works for this sample:

sample_success.pdf default

But not this one:

sample_fail.pdf default

Both can be rendered correctly by the online viewer:

default

default

jzillmann commented 6 years ago

Hey @sati-bodhi thanks for the samples. I tried a few things but had no luck so far... Asked the pdf.js community for help: https://github.com/mozilla/pdf.js/issues/9692

jzillmann commented 6 years ago

Based on the response in https://github.com/mozilla/pdf.js/issues/9692 i'm closing this as won't fix. The PDF seems to have a incomplete ToUnicode data...

sati-bodhi commented 6 years ago

Thanks! The answer given by @timvandermeij was elaborate indeed.