jdorn / sql-formatter

A lightweight php class for formatting sql statements. Handles automatic indentation and syntax highlighting.
http://jdorn.github.com/sql-formatter/
MIT License
3.89k stars 185 forks source link

Support for HHVM #62

Open h4cc opened 10 years ago

h4cc commented 10 years ago

In case somebody else tries to add support for HHVM, this is the current state:

There is currently a problem with some test with unicode caracters: https://travis-ci.org/h4cc/sql-formatter/jobs/20499499 Cause is the htmlentities function, which does not encode unicode characters with corresponding entities: https://github.com/facebook/hhvm/issues/1875

A solution could be replacing htmlentities with htmlspecialchars. But even there is a small bug, because using ENT_IGNORE will simple skip all non ASCII chars: https://github.com/facebook/hhvm/issues/2071

Example:

<?php
$s = 'èéæó';
var_dump(
    $s,
    htmlspecialchars($s, ENT_COMPAT, 'UTF-8'),
    htmlspecialchars($s, ENT_COMPAT | ENT_IGNORE, 'UTF-8'),
    htmlspecialchars($s, ENT_COMPAT | ENT_SUBSTITUTE, 'UTF-8')
);

Output:

PHP
string(10) "èéæó"
string(10) "èéæó"
string(10) "èéæó"
string(10) "èéæó"
HHVM
string(10) "èéæó"
string(10) "èéæó"
string(0) ""
string(10) "èéæó"

But this should be solveable by switching to ENT_SUBSTITUTE.

@jdorn What do you think? Waiting for a HHVM fix or using htmlspecialchars with ENT_SUBSTITUTE?

jdorn commented 10 years ago

I'm fine adding a check for ENT_SUBSTITUTE support and preferring that over ENT_IGNORE. There's already a check at https://github.com/jdorn/sql-formatter/blob/master/lib/SqlFormatter.php#L867 so adding another if condition isn't a big deal.

h4cc commented 10 years ago

I am not sure if ENT_SUBSTITUTE will be a usefull parameter. If there is much binary data in the query, it will be encoded instead of skipped.

What about switching to htmlspecialchars? It will not output "Ã" instead of "Ã "what htmlentities does.

jdorn commented 10 years ago

ok, so to fix this, we would need to switch to htmlspecialchars and add an extra if statement to check for ENT_SUBSTITUTE support?

I'm not sure I would do this just for a HHVM bug, which will probably be fixed soon. Would this provide any benefit to normal PHP usage? I don't know too much about character encodings.

h4cc commented 10 years ago

I dont think there is a quick but lasting fix here. HHVMs htmlentites does not return HTML-Entites for UTF8 inputs, because HHVM is based on UTF8, not Latin1. So basically HHVM's htmlentities with UTF8 behaves like the default htmlspecialchars, except for the Bug with the ENT_IGNORE flag.

There is also a hint we should look at here: http://www.php.net/manual/en/function.htmlentities.php

ENT_IGNORE  Silently discard invalid code unit sequences instead of returning an empty string. Using this flag is discouraged as it » may have security implications. 

Summarized, we could leave everything as is and skip HHVM testing. Or we could follow the overall recommondations and stop creating HTML-Entities for UTF8 chars. If we do so, we have to change some some tests and can get HHVM testing.

My opinion is, to switch to 'htmlspecialchars' and release this change as a 1.3.* version.

garyemiller commented 10 years ago

HHVMs htmlentites does not return HTML-Entites for UTF8 inputs, because HHVM is based on UTF8, not Latin1.

Lost me there. Does not matter UTF8 or Latin1, you do not want to emit a naked & in HTML because that is invalid HTML. Ditto < and >.

Per W3C: http://www.w3.org/International/questions/qa-escapes#use

Sure only htmlspecialchars() is sufficient for valid html, but a lot of programmers use htmlentities() to make their HTML more robust and readable, even in a UTF8 environment.

Sadly, since htmlspecialchars() is currently broken in hhvm using it instead of htmlentitities() still leaves brokenness.

Regardless of the arguments pro or con, the goal should be to duplicate existing PHP behavior, not to rethink it.

h4cc commented 10 years ago

In the end it is not critical, because this library works on HHVM as expected. Just the tests will not run exactly the same because of this one different behaviour. Lets see what the future brings.

Lost me there. Does not matter UTF8 or Latin1, you do not want to emit a naked & in HTML because that is invalid HTML. Ditto < and >.

HHVM will create html entites for "<>&" as intended. It will just not create html entites for "Ãà...".

garyemiller commented 10 years ago
In the end it is not critical, because this library works on HHVM as expected. 

Uh, no. I expect it to work as in the PHP documentation. See HHVM bug #2061

HHVM will create html entites for "<>&" as intended. 

Uh, no, see HHVM bug #2064