ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
518 stars 69 forks source link

Poppler-data and CJK #21

Closed kochiuyu closed 6 years ago

kochiuyu commented 6 years ago

Thank you for creating the package. It is excellent when working under English pdf files.

When I try to use it for pdf with Chinese (an example is attached) example.pdf

I have received the following warning:

Warning: error: Missing language pack for 'Adobe-CNS1' mapping

I have tried to download standalone poppler pdftotext and try to add the library. It seems to work in the commnand. How can I add the poppler-data to the pdftools installation so that I can work in R?

Thanks.

jeroen commented 6 years ago

Ha I was wondering what the poppler-data files were for. I'll have a look.

kochiuyu commented 6 years ago

It is about encoding of pdf. According to the author of the package:

This package consists of encoding files for use with poppler. The encoding files are optional and poppler will automatically read them if they are present. When installed, the encoding files enables poppler to correctly render CJK and Cyrrilic properly. While poppler is licensed under the GPL, these encoding files have different license, and thus distributed separately.

You may see the data structure here: https://www.archlinux.org/packages/extra/any/poppler-data/files/

jeroen commented 6 years ago

It is unclear how to specify the path to these files if it's not the default /usr/share/poppler/. I have asked it on the mailing list, hopefully somebody responds.

kochiuyu commented 6 years ago

If in windows system, you can put it in the path of where you save the bin file of poppler.

What I have done is to download this zip file from the following address: http://blog.alivate.com.au/poppler-windows/ Extract it and put it into the a folder, say, poppler. Then it has four subfolders: bin, include, library, and share.

Then I download the poppler-data file from: https://poppler.freedesktop.org/ I extract the file and put all its content in a folder called "poppler" Put this new folder under share subfolder. That is, the data files will be put into poppler/share/poppler. For example, there is a file called CMakeList.txt. It will be located as "poppler/share/poppler/CMakeList.txt"

Then if you use the command line (cmd) to run pdftotext on the "exmaple.pdf" that I provided. It would be okay (with some minor warnings).

jeroen commented 6 years ago

Right but we don't use a poppler executables because we dynamically load it as a shared library into R. So I'm not sure where it tries to look for these files then.

kochiuyu commented 6 years ago

Thanks for the explanation. It seems like a difficult problem.

jeroen commented 6 years ago

Making some progress here. So you're using Windows right?

jeroen commented 6 years ago

@kochiuyu I've made some changes, could you please test the devel version:

devtools::install_github("ropensci/pdftools")
kochiuyu commented 6 years ago

Thank you. Yes, I use windows system.

It works nicely. The program does not complain the problem.

It now imports pdf file with the Chinese characters without complaints.

Should I close the issue now? Sorry, I am new to GitHub.

jeroen commented 6 years ago

Yes you can close it now if your problem is been addressed.

jeroen commented 6 years ago

This is now on CRAN. Thanks for the suggestion!