gleam-lang / stdlib

🎁 Gleam's standard library
https://hexdocs.pm/gleam_stdlib/
Apache License 2.0
492 stars 175 forks source link

Prevent control codes actioning in string.inspect #602

Closed Michael-Mark-Edu closed 5 months ago

Michael-Mark-Edu commented 6 months ago

Fixes #600, Fixes #607

Unfortunately, Gleam doesn't recognize \b or \v as valid escape codes, which I believe makes it impossible to create a test for these changes. I'll likely open a new issue about making those valid escapes.

It should also be noted that some escape codes are "invisible" but still count for string equality, resulting in the amusing bug of "abc123" != "abc123" because of null characters in one of the strings. However, this is arguably intentional behavior and I'll let a maintainer decide if anything should be done about that.

mooreryan commented 6 months ago

Erlang also recognizes \e (ASCII code 27), should that also be added? (And potentially bell (\a) (ASCII code 7)--maybe not, Erlang doesn't use this one, but some languages do.

Michael-Mark-Edu commented 6 months ago

Oh right, \e. I tested it and found that it was actioning, but I forgot to patch it in the commit. Going to add a new commit now

Michael-Mark-Edu commented 6 months ago

@lpil alright, tests have been added (using \u) and they are passing. You can try to comment out the lines added to gleam_stdlib.erl and the tests will fail.

Michael-Mark-Edu commented 5 months ago

I added code that generalizes everything <32 that isn't already explicitly on the list. Also changed it to use \u{xxxx} syntax.

mooreryan commented 5 months ago

I'm not necessarily saying that characters from [0, 31] should not be escaped in this way, but I do think it makes it a bit confusing why other control characters (eg some of these) are not escaped in the same way.

The previous behavior could be explained by saying, "we only escape valid gleam control character syntax" or something like that, and I suppose this behavior could be explained by stating that ascii control characters are now also escaped, but it still opens the question as to why the other control characters are not escaped. (See the comment here: https://github.com/gleam-lang/stdlib/issues/600#issuecomment-2130343645).

Regardless, I think it bears discussing at least.

Michael-Mark-Edu commented 5 months ago

Does UTF-8 have escape characters beyond outside of 7-bit ASCII range? I've been operating on the assumption that everything after 127 is unicode and that no escape characters exist inside the unicode set.

I'm looking at UTF-8 character sets right now, and it seems 0x80-0x9F may be escape codes that I didn't realize existed. If they are, then they should probably be added to the \u list.

Michael-Mark-Edu commented 5 months ago

Alright, it uses io_lib:format now. Also, control characters 128-159 are now being handled.

mooreryan commented 5 months ago

Does UTF-8 have escape characters beyond outside of 7-bit ASCII range? I've been operating on the assumption that everything after 127 is unicode and that no escape characters exist inside the unicode set.

I'm looking at UTF-8 character sets right now, and it seems 0x80-0x9F may be escape codes that I didn't realize existed. If they are, then they should probably be added to the \u list.

I'm not sure if they are generally "escaped" by languages (I think not in the sense of \n or something), but they would probably be better represented as \u{XXXX}.

Michael-Mark-Edu commented 5 months ago

I'm not sure if they are generally "escaped" by languages (I think not in the sense of \n or something), but they would probably be better represented as \u{XXXX}.

I think they should. According to the website you linked, the codes between 0-31 and 127-159 consist of exactly all the control codes in UTF-8. I haven't tested them, but I presume the second set of codes has the same effects of the first (invisible graphemes, actioning) and so should be handled with in the same way too.

mooreryan commented 5 months ago

I think they should.

I think we are saying the same things, but using different terms.

For example, I mean the unicode codepoint U+000D can be represented by this escape sequence \u{000D} as well as this one \r. When you mentioned "Does UTF-8 have escape characters beyond outside of 7-bit ASCII range?" I thought you were meaning does there exist other characters commonly represented by the "classic" c-style escape sequences like \r, \t, etc, that are outside of the range of 0-127. To that, I'm not sure, but there are definitely control characters that probably most naturally would be good to represent with the \u{XXXX} style escape sequences. (Whether that belongs in this PR is potentially a separate question.)

The other thing I'm interested in is the fact there there are many more "invisible" characters way outside the normal ascii range (eg the En Space. Should these be printed as-is, or should the also be translated to escape sequences? And what about the example of the "e" with accent? (There are many like that as well.) What I'm not sure of is how far to go, so to speak, with the escaping, either in this PR, future PRs, or maybe beyond the scope of the stdlib in a separate library.

Michael-Mark-Edu commented 5 months ago

When you mentioned "Does UTF-8 have escape characters beyond outside of 7-bit ASCII range?" I thought you were meaning does there exist other characters commonly represented by the "classic" c-style escape sequences like \r, \t, etc, that are outside of the range of 0-127.

I was referring to the mere presence of control characters outside of the 7-bit range. I wasn't thinking about if any of them were common enough to have a C-escape sequence (according to Wikipedia none of them do).

The other thing I'm interested in is the fact there there are many more "invisible" characters way outside the normal ascii range (eg the En Space.

As of now, I'd argue to keep them because the UTF-8/Unicode standard defines those characters to be the way they are and to be distinct, even if I personally disagree with the standard. It's unlike the invisible graphemes where it's impossible to detect them without viewing binary data.

Potentially, there could be a configuration flag to gleeunit to "ASCII-ify" debug output and escape all characters that are >127. This would be optional because it limits functionality, but it would be potentially useful as an optional flag. What this flag would actually be (CLI option? gleeunit.main() parameter? gleam.toml?) isn't something I'm sure of, but I think it's an idea worth considering.

Michael-Mark-Edu commented 5 months ago

Could we please get a workflow approval for this PR? I want to see green nvm the JS test failed.

lpil commented 5 months ago

Looks like you have a failing test when running on the JavaScript target

Michael-Mark-Edu commented 5 months ago

Looks like JavaScript target does things differently when it comes to escape codes compared to Erlang. This seems to arise from the additional tests I added, as the only modifications I made to the original source were to the .erl file.

The specific failure of the JavaScript test was "\"\\b\"" should equal "\"\\u{0008}\"" so it seems the JavaScript target is automatically converting \u{0008} into \b. \b is not anywhere in the test file so this must be JavaScript's doing. Since \b is not valid Gleam syntax, this seems like a unrelated bug with the JavaScript target.

Since this PR was originally meant to fix a bug with the Erlang target, I am unsure if patching this JavaScript bug is within the scope of this PR.

Michael-Mark-Edu commented 5 months ago

I did some grepping in the stdlib and couldn't find any mention of \b related to JavaScript, so I think this might be functionality related to JavaScript/NodeJS. If so, then we will likely have to override whatever function is generating the \b on the JavaScript side.

Michael-Mark-Edu commented 5 months ago

It also seems that unrecognized unicode gets converted into \uxxxx instead of \u{xxxx} format when targeting JavaScript, so none of the other tests I added should be passing either.

Michael-Mark-Edu commented 5 months ago

Going down the chain of nested function calls, I eventually came across JSON.stringify(v) which seems to be where the \b and \u000B are coming from. Testing it in the node CLI, I'm getting the non-Gleam escape sequences. I think we may need to replace JSON.stringify(v) with our own function here if we don't want parity issues between JavaScript and Erlang, or JavaScript producing invalid Gleam strings.

Michael-Mark-Edu commented 5 months ago

Alright, I bit the bullet and decided to do the JS implementation myself. Now, the output of string.inspect should be the same between Erlang and JS.

I'm not experienced with JS at all, so I am requesting code review on commit 1cec04a.

Michael-Mark-Edu commented 5 months ago

@lpil this PR should be ready for another workflow test (doing the commands in the workflow on local, everything passes) and code review.

lpil commented 5 months ago

Sorry, you'll need to rebase on main to have the tests run due to the merge conflict

Michael-Mark-Edu commented 5 months ago

Alright, merge conflict should be resolved. Seems like git got confused and triggered a conflict even though the changelog didn't have any serious conflicts.

Michael-Mark-Edu commented 5 months ago

It seems like some upstream changes are causing gleam format --check to fail on unrelated files through no fault of my own. I'm on my phone right now so I can't commit a format commit right now.

lpil commented 5 months ago

Hmm I'm not sure why rebasing the branch to remove the merge commit closed the PR. Perhaps something to do with it being called main? I think I made a mistake somewhere. I'll rebase these onto main manually, thank you