mikehaertl / phpwkhtmltopdf

A slim PHP wrapper around wkhtmltopdf with an easy to use and clean OOP interface
MIT License
1.6k stars 238 forks source link

html entities for UTF8 charcters in replace option #331

Closed ducciodiblasi closed 5 years ago

ducciodiblasi commented 5 years ago

Hi, I've just resolved an issue with option --replace for your wrapper. when i passed some special character (example è) to the replace parameter, i got it passed uncorrectly to the footer, so when I converted it to uppercase with css directive text-transform: uppercase;, the character disappeared. Since wkhtmltopdf is just a browser under the hood, when you pass parameters in the replace parameter you have to convert special UTF8 characters to html entities if you want them to be passed correctly as POST when it makes header and footer calls. I have derived your pdf class and overridden the following function:

public function setOptions($options = array()){
    if (isset($options['replace'])) {
        foreach ($options['replace'] as $key => $val) {
            $options['replace'][$key] = htmlentities($val);
        }
    }
    parent::setOptions($options);
}

hope this helps regards

mikehaertl commented 5 years ago

Hmm, are you sure, you've set the correct encoding? You may also have to pass some procEnv setting (see README). And does this also apply to the main HTML content of your page?

AFAIK htmlentities() can not convert all character to UTF8, only those that have a HTML entity. So I'm hesitant to add this. I actually think it should work without htmlentities() if everything is set to UTF8.

ducciodiblasi commented 5 years ago

I have printed out (it is a header.php not header.html, so i can do a file_put_contents()) the $_REQUEST variable, and it printed out without the è character, that's why I thought the issue was in the parameter itself, that when is read from the header.php is already corrupted, and not a matter of charset (since the file_put_contents acts before the page rendering). From my tests I have seen that the character is there until wkhtmltopdf is called, when it is called it makes in turn the calls to the header and footer, passing them the replace parameters as POST (or GET?), and maybe there the character gets corrupted.

You are right, htmlentities is not the best solution, since is not complete, and does not deal with all cases (for example I had to add a nl2br after htmlentities to deal with new lines), and I would be hesitant too. Maybe the best is to extend your class to feed own needs, as I did.

I don't know all the conversion functions that php give us, maybe there is something out there I have tried also an urlencode since they are HTML requests, with no luck, but maybe during the tests I made some mistakes, I was in a hurry...

Just be aware of the issue (which seems to be rather in wkhtmltopdf than in your class)

regards

mikehaertl commented 5 years ago

I'm a bit confused: Our library has nothing to do at all with REQUEST, GET or POST parameters. It just creates a lengthy shell command string that executes wkhtmltopdf on the command line. As --replace params are passed on the comand line it's important that the shell environment can work with UTF8.

Some (simple!) code example to reproduce the issue would be helpful.

ducciodiblasi commented 5 years ago

I can understand the confusion, my config is a bit complicated. My header is not a plain html with js file, it is a php script instead, that performs an xsl transformation to an xml file stored in the server filesystem. When i call your library i pass the header-html parameter as something like http://url/to/heder.php, and when your library runs the shell command, wkhtmltopdf calls that url passing to it the replace parameters in POST (or GET, I don't remember).

You're right, your library has nothing to do at all with REQUEST, GET or POST, that's why i said that the issue seems to be rather in wkhtmltopdf than in your class.

What do you mean for "it's important that the shell environment can work with UTF8"? what should be done to be sure of that?

Just for study I'll try to reproduce the issue in a shell command, if I suceed I'll tell you.

Regards

mikehaertl commented 5 years ago

Here's the part from the README:

$pdf = new Pdf(array(
      'commandOptions' => array(
        'procEnv' => array(
            // Check the output of 'locale -a' on your system to find supported languages
            'LANG' => 'en_US.utf-8',
        ),
    ),
));

I'm also not really an expert on this, but when a program is started it will checks these environment variables. As I understand it, the LANG variable tells the program what character encoding is used e.g. for shell arguments. You can use locale -a on your system (hopefully Linux?) to get a list of all values you can set there. With env | grep LANG you can find out, if you're currently using a UTF8 locale. For me (Germany) it's LANG=de_DE.UTF-8.

It would be interesting what happens if you call wkhtmltopdf with some UTF8 character on the command line, for example:

wkhtmltopdf \
    --header-center 'test {x}' \
    --replace '{x}' 'è' \
    index.html

If this works then you should be able to pass the right LANG in procEnv as shown above.

Oh, and to take out some complexity, start with a simple script where you focus on the PDF part. Just pass the values directly instead of some nifty XSL transformation and whatever - otherwhise you'll go insane to find which part of your setup is really causing the issue.

mikehaertl commented 5 years ago

I'm pretty sure, that with the right locale settings everything works.

Note to self: Add pdfbox to the test environment so we can extract texts from created PDF files and verify that UTF-8 works.

mikehaertl commented 5 years ago

Ok, I couldn't make it work either. But this seems to be a bug with wkthmltopdf itself: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2427

Here's a simple test:

wkhtmltopdf --header-left 'äö [x]' --replace 'x' 'üüü' index.html out.pdf

While the äö work fine, the üüü in the replace argument don't.