php / php-src

The PHP Interpreter
https://www.php.net
Other
38.19k stars 7.75k forks source link

Enhance `zend_dump_op_array` to Properly Represent Non-Printable Characters in String Literals #15680

Open WangYihang opened 2 months ago

WangYihang commented 2 months ago

Description

Description:

While working with the zend_dump_op_array function in PHP's Zend Engine, I noticed that non-printable characters in string literals are not always represented in their escaped (readable) form. Currently, the function escapes ^1 only a limited set of characters by calling:

zend_string *escaped_string = php_addcslashes(Z_STR_P(zv), "\"\\", 2);

This can result in the direct output of characters such as newlines, tabs, and null bytes, which might make the dump less clear and harder to debug.

For example, for the following php code:

<?php

// irrelevant code omited

fwrite($buffer, "--\n");

Current Output:

0040 INIT_FCALL 2 112 string("fwrite")
0041 SEND_VAR CV6($buffer) 1
0042 SEND_VAL string("--
") 2                                     <------ new line (\n) is directly printed
0043 DO_ICALL

Expected Output:

I believe that modifying the function to escape all non-printable characters would enhance the clarity of the output. This could be achieved by updating the call to:

php_addcslashes(Z_STR_P(zv), "\\\"\n\t\0", 5);

With this change, the output might look like:

0040 INIT_FCALL 2 112 string("fwrite")
0041 SEND_VAR CV6($buffer) 1
0042 SEND_VAL string("--\n") 2           <------ properly escaped with readable representations
0043 DO_ICALL

Proposal:

I propose a small enhancement to the zend_dump_op_array function to ensure that all non-printable characters are properly represented in the output. This adjustment would improve readability and make debugging easier, without significantly altering the existing functionality.

Benefits:

Enhanced Readability: Ensures that non-printable characters are clearly represented, making the output easier to understand. Improved Debugging: Developers can quickly identify and interpret all characters within string literals. Minor Change: The proposed adjustment is a minimal change to the existing codebase but offers significant benefits.

I would greatly appreciate feedback from the community on this suggestion.

References

WangYihang commented 2 months ago

Additionally, I believe it would improve debugging if all non-printable characters were consistently marked for escaping.

If there is agreement, I would be happy to submit a pull request to implement this improvement.

devnexen commented 2 months ago

Additionally, I believe it would improve debugging if all non-printable characters were consistently marked for escaping.

If there is agreement, I would be happy to submit a pull request to implement this improvement.

I'd say go ahead and submit a PR :)

WangYihang commented 2 months ago

It seems there’s already a related pull request [1-2] that attempted to address this issue. However, using php_addcslashes to quote all non-printable characters isn’t ideal, as it requires a long string to specify the characters to be quoted.

I propose two potential solutions:

@devnexen Which of these approaches do you think would be more suitable, or is there another direction you would recommend for implementing this functionality? @SerafimArts @dstogov

WangYihang commented 2 months ago

Implemented and passed tests in https://github.com/php/php-src/pull/15730 (Using Solution I), please kindly review, thanks for your time.

SerafimArts commented 2 months ago

Which of these approaches do you think would be more suitable, or is there another direction you would recommend for implementing this functionality?

@WangYihang My pull request was aimed at providing the ability to programmatically read such output without parsing mistakes/errors.

I implement the lang server plugin, which showed the bytecode that PHP generates next to the code of functions and classes, which allows you to simply brute-force optimize time critical sections of code right in the IDE (since the opcode result is immediately visible, including optimization steps). And I simply couldn't find any other simple solution other than parsing this output)))

That is, at that time it was not critical to me whether there were line breaks or not, since this does not interfere with the programmatic parsing of the opcode output. The main thing is that it starts with a quotation mark and ends with one.