nikic / PHP-Parser

A PHP parser written in PHP
BSD 3-Clause "New" or "Revised" License
17.03k stars 1.1k forks source link

Hex chars greater than \x7f aborts silently the parsing #550

Open nicolasrod opened 5 years ago

nicolasrod commented 5 years ago

Hi. I ran into an issue regarding hex chars in a double quoted string. If I have a piece of code like the following:

<?php 
$a = "\x6f";

I get as a result the following:

[{"nodeType":"Stmt_Expression","expr":{"nodeType":"Expr_Assign","var":{"nodeType":"Expr_Variable","name":"a","attributes":{"startLine":2,"endLine":2}},"expr":{"nodeType":"Scalar_String","value":"o","attributes":{"startLine":2,"endLine":2,"kind":2}},"attributes":{"startLine":2,"endLine":2}},"attributes":{"startLine":2,"endLine":2}}]

But if the variable hold a value greater than \x7f, I get an empty array as a result and no error. Any ideas? Thank you!

nikic commented 5 years ago

The problem here is probably in the JSON encoding. JSON only allows valid UTF-8 in strings, and \x7f is not a valid UTF-8 sequence.

performantdata commented 5 years ago

@nikic I don't understand your answer here. A string in PHP is an array of bytes, so any valid byte values are allowed. The problem is that you're representing it as a string in JSON, instead of as an array of numbers.

tiyeuse commented 5 years ago

Any update about this issue ?

nikic commented 5 years ago

Nope. Any suggestions on what to do about this?

zhaoyanliang2 commented 5 years ago

Before converting ast to json, iterate through all nodes and encode the variable containing the illegal utf-8 string using base64_encode.

performantdata commented 5 years ago

Any suggestions on what to do about this?

The problem is that you're representing it as a string in JSON, instead of as an array of numbers.

So represent it as that. A PHP string is not an array of Unicode characters, it's just an array of bytes.

This nature of the string type explains why there is no separate “byte” type in PHP – strings take this role.

So stop trying to convert an arbitrary sequence of bytes into UTF-8.

nikic commented 5 years ago

Before converting ast to json, iterate through all nodes and encode the variable containing the illegal utf-8 string using base64_encode.

That sounds reasonable. We can add two extra visitors for encoding/decoding all strings in base64. It's unfortunate that this is necessary, but don't really see a way around.

tiyeuse commented 5 years ago

Bump on this error :smiley: Will a fix be deployed ?