klippa-app / pdfium-cli

Easy to use PDF CLI tool powered by PDFium and go-pdfium
MIT License
21 stars 2 forks source link

Chinese garbled code #17

Closed hdf15804299051 closed 1 week ago

hdf15804299051 commented 1 year ago

when i use pdfium render chinese.pdf ,Chinese can`t parse ,what can i do to resolve this problem

jerbob92 commented 1 year ago

Do you have an example PDF to reproduce it with and which command you're executing exactly?

hdf15804299051 commented 1 year ago

command is ./pdfium render ./image/1.pdf ./image/%d.jpg

hdf15804299051 commented 1 year ago

maybe chinese font is lacking

jerbob92 commented 1 year ago

Can't really say anything without the PDF itself. If the font is not embedded inside the PDF, and the font is also not available on your system, it won't be able to display the characters properly.

hdf15804299051 commented 1 year ago

this example pdf 1.pdf address is https://file-plaso.oss-cn-hangzhou.aliyuncs.com/dev-plaso/infinite_wb/files/test/1.pdf?OSSAccessKeyId=LTAI5tPj9a84te86oRUfAjp8&Expires=1696247568&Signature=UKpTn5DMGljUQ6Mg8JlBqOmKtBQ%3D

jerbob92 commented 1 year ago

That PDF renders fine for me. Can you tell me more in what kind of environment you're running pdfium-cli? It might be that you're missing the font in your environment.

hdf15804299051 commented 1 year ago

can you provide the 1.pdf renders result for me,Because I use other pdfium-cli can parse the 1.pdf fine ,so I can determine the font in my environment is fine.

jerbob92 commented 1 year ago

Why would I share them? It's just jpeg of the PDF, you don't believe it works for me?

Can you tell me more in what kind of environment you're running pdfium-cli?

Please answer this question.

hdf15804299051 commented 1 year ago

I am sure my system does not lack Song typeface, as this PDF file can be parsed normally using other parsing tools, using the pdfium you provided_ When the cli tool is parsing, it is not possible to parse Chinese properly, so can you send me the image that you successfully parsed for me to see,I'm just checking to see if the pictures are missing Chinese

jerbob92 commented 1 year ago

They are not. It looks like you are unwilling to provide the information necessary to debug this issue for you, so I will close this issue for now.

hdf15804299051 commented 1 year ago

I believe you but I don`t know how to resolve it ,

hdf15804299051 commented 1 year ago

My operating system is Windows 10

jerbob92 commented 1 year ago

Thank you. That is important information to have. The library that we use is compiled to WebAssembly which behaves like Linux on all environments. This means that it will try to look for the fonts in the same location as on Linux, which ofcourse don't exist on Windows, hence the issue when fonts are not embedded inside the PDF. I will check to see if there is anything I can do about that in the WebAssembly build.

I am building native binaries in this MR: https://github.com/klippa-app/pdfium-cli/pull/13 These run without WebAssembly, so they will need pdfium to be on your system, but since it's a native Windows build it should work better for you.

hdf15804299051 commented 1 year ago

thank you for your answer ,

jerbob92 commented 1 year ago

Ok so I looked into it. The Windows font mapper is pretty complex on Pdfium, it's not something that I can support in the WebAssembly build.

How it currently works is that it searches in the following directories for fonts:

On Windows these map to (if your current working directory is on the C drive):

What I can add is an option that allows you to give font search paths so that you can use any path, or do you know of Windows paths that can be added to the search list where the fonts can be found?

hdf15804299051 commented 1 year ago

ok ,You can set a directory as the font directory, and I can put the fonts in it****

jerbob92 commented 1 year ago

You can try that already, if you make the folder "C:\usr\share\fonts" on your computer and put the fonts in there, you can see if it works.

hdf15804299051 commented 1 year ago

thank you , I have resolve this problem on your help. you method is right. thank you again

hdf15804299051 commented 1 year ago

I put the fonts in \usr\share\fonts\TTF, the pdf is parse right.

jerbob92 commented 1 year ago

Thanks for letting me know! I will see if I can add the default Windows/MacOS font paths to the search list in the WebAssembly build so that it works without doing these manual changes.

hdf15804299051 commented 1 year ago

Hello, I am currently encountering a font related issue. My PDF file is still the same as the one I sent you last time, and its font is SimSun. However, when I parse it, the font of the parsed image becomes Lisu. It's very strange, but when I deleted the font file/usr/share/fonts/ttf/Lisu.ttf, the parsed image returned to normal and became a SimSun font. Therefore, I would like to inquire about the rules for matching fonts on your end.The operating system I am currently using is Linux, and my font directory uses/usr/share/fonts/ttf. There are only two files in this directory: Lisu.ttf and SimSun.ttf. If this directory does not contain any font files, Chinese garbled characters will appear when parsing PDFs. This is the situation in my current environment

hdf15804299051 commented 1 year ago

Multiple experiments have found that the rules for matching Chinese fonts seem to be based on the font order in the font directory. This matching rule is not very clear, because I added several fonts such as LiSu, Microsoft YaHei, and NotoAnsCJKjp Regular in the font directory, and then deleted LiSu, resulting in the parsed font becoming Microsoft YaHei. Then, deleted Microsoft YaHei, resulting in the parsed font becoming NotoAnsCJKjp Regular

jerbob92 commented 1 year ago

@hdf15804299051 I can't really control that directly, that all happens inside pdfium. It could be that it can't find the actual font and it then replaces it with the first font that has the same char set.

Do you have any control over how these PDF's are generated? If so, I would suggest embedding the font.

jerbob92 commented 1 week ago

Closing due to lack of reply.