Open h4cc opened 10 years ago
I'm fine adding a check for ENT_SUBSTITUTE
support and preferring that over ENT_IGNORE
. There's already a check at https://github.com/jdorn/sql-formatter/blob/master/lib/SqlFormatter.php#L867 so adding another if condition isn't a big deal.
I am not sure if ENT_SUBSTITUTE will be a usefull parameter. If there is much binary data in the query, it will be encoded instead of skipped.
What about switching to htmlspecialchars? It will not output "Ã" instead of "Ã "what htmlentities does.
ok, so to fix this, we would need to switch to htmlspecialchars and add an extra if statement to check for ENT_SUBSTITUTE support?
I'm not sure I would do this just for a HHVM bug, which will probably be fixed soon. Would this provide any benefit to normal PHP usage? I don't know too much about character encodings.
I dont think there is a quick but lasting fix here. HHVMs htmlentites does not return HTML-Entites for UTF8 inputs, because HHVM is based on UTF8, not Latin1. So basically HHVM's htmlentities with UTF8 behaves like the default htmlspecialchars, except for the Bug with the ENT_IGNORE flag.
There is also a hint we should look at here: http://www.php.net/manual/en/function.htmlentities.php
ENT_IGNORE Silently discard invalid code unit sequences instead of returning an empty string. Using this flag is discouraged as it » may have security implications.
Summarized, we could leave everything as is and skip HHVM testing. Or we could follow the overall recommondations and stop creating HTML-Entities for UTF8 chars. If we do so, we have to change some some tests and can get HHVM testing.
My opinion is, to switch to 'htmlspecialchars' and release this change as a 1.3.* version.
HHVMs htmlentites does not return HTML-Entites for UTF8 inputs, because HHVM is based on UTF8, not Latin1.
Lost me there. Does not matter UTF8 or Latin1, you do not want to emit a naked & in HTML because that is invalid HTML. Ditto < and >.
Per W3C: http://www.w3.org/International/questions/qa-escapes#use
Sure only htmlspecialchars() is sufficient for valid html, but a lot of programmers use htmlentities() to make their HTML more robust and readable, even in a UTF8 environment.
Sadly, since htmlspecialchars() is currently broken in hhvm using it instead of htmlentitities() still leaves brokenness.
Regardless of the arguments pro or con, the goal should be to duplicate existing PHP behavior, not to rethink it.
In the end it is not critical, because this library works on HHVM as expected. Just the tests will not run exactly the same because of this one different behaviour. Lets see what the future brings.
Lost me there. Does not matter UTF8 or Latin1, you do not want to emit a naked & in HTML because that is invalid HTML. Ditto < and >.
HHVM will create html entites for "<>&" as intended. It will just not create html entites for "Ãà...".
In the end it is not critical, because this library works on HHVM as expected.
Uh, no. I expect it to work as in the PHP documentation. See HHVM bug #2061
HHVM will create html entites for "<>&" as intended.
Uh, no, see HHVM bug #2064
In case somebody else tries to add support for HHVM, this is the current state:
There is currently a problem with some test with unicode caracters: https://travis-ci.org/h4cc/sql-formatter/jobs/20499499 Cause is the htmlentities function, which does not encode unicode characters with corresponding entities: https://github.com/facebook/hhvm/issues/1875
A solution could be replacing htmlentities with htmlspecialchars. But even there is a small bug, because using ENT_IGNORE will simple skip all non ASCII chars: https://github.com/facebook/hhvm/issues/2071
Example:
Output:
But this should be solveable by switching to ENT_SUBSTITUTE.
@jdorn What do you think? Waiting for a HHVM fix or using htmlspecialchars with ENT_SUBSTITUTE?