Closed rotdrop closed 2 years ago
I stumbled over the issue that I can set bookmarks container non-ASCII characters on the command line using pdftk.
Can you give an example how you do that? Usually we write any data to temporary files and take care of correct encoding there. So this should have nothing to to with proc_open()
. The only issues there are when you have UTF-8 characters in the PDF filename. But this can be fixed by setting the correct locale (see https://github.com/mikehaertl/php-pdftk/issues/286#issuecomment-1139535638).
It actually has nothing to do with proc_open() apart from the fact that procopen() seems to start with an empty environment. The problem is that pdftk
fails to parse UTF-8 input correctly if the `LC....or
LANG` locales do not point to an UTF-8 capable locale. You can see this if you try to add a bookmark with a non ASCII title, like e.g.,
BeginBookmark
BookmarkTitle: äää
...
and feed it from the command line into pdftk with a "clean" environment like so:
$ env -i pdftk A=INPUT.pdf update_info_utf8 MY_INFO_FILE output OUTPUT.pdf
The resulting OUTPUT.pdf will show the bookmark with replacement characters.
This is of course rather a pdftk issue. But this is the way it is. The suggestion would be to add a LC_ALL=C.UTF-8 and/or LANG=C.UTF-8 to the environment of open_proc.
Ah,
Can you give an example how you do that? Usually we write any data to temporary files and take care of correct encoding there. So this should have nothing to to with
proc_open()
. The only issues there are when you have UTF-8 characters in the PDF filename. But this can be fixed by setting the correct locale (see #286 (comment)).
No, unfortunately not. Its not only that pdftk
does not find filenames, it also does not interpret non-ASCII chars correctly when used as, e.g., bookmark titles. Did not try, but I fancy the same holds for the Info-Variables.
The suggestion would be to add a LC_ALL=C.UTF-8 and/or LANG=C.UTF-8 to the environment of open_proc.
You can set this. Please check the link I provided. We can't do this automatically as we can not assume that any locale is available on every machine.
I've added a note on how to configure the locale to the README.
I stumbled over the issue that I can set bookmarks container non-ASCII characters on the command line using pdftk. However, using pdftk through this php-pdftk package just garbles any non-ASCII character in bookmarks and other meta-info.
This is seemingly caused by using PHP proc_open(). It seems that proc_open() does not set any LANG or LC environment variable and this makes pdftk not recognize UTF-8 characters in the input to update_info_utf8. This may be considered a bug in pdftk, as the operation 'update_info_utf8; suggests that this should work.
Can also be that PHP does some nasty thing here.
Just for the reference. Maybe one could add an open to php-pdftk to set the LC_ env or just use en_US.UTF-8 to fix the issue.
This is on the master branch of php-pdftk and with