rome / tools

Unified developer tools for JavaScript, TypeScript, and the web
https://docs.rome.tools/
MIT License
23.74k stars 659 forks source link

💭 Escape non-printable characters #1133

Closed Skalman closed 1 year ago

Skalman commented 4 years ago

Having non-printable characters in the code can be very confusing. I'd suggest automatically reformatting some characters by using their escape sequences instead. Usually, when these types of characters are added to the source code, it's a mistake, and depending on the editor it may well be invisible and therefore very hard/annoying to debug.

// There's a zero-width space between `Hello` and `World`

// Before
console.log("Hello​World");

// After
console.log("Hello\u200bWorld");

Whitespace

I've personally had problems with U+200B zero-width space and U+00A0 non-breaking space. But I think it's reasonable to escape all space characters except U+000A newlines (and probably carriage U+000D returns), U+0020 spaces and U+0009 tabs.

Here are other space chars: U+1680 U+2000 U+2001 U+2002 U+2003 U+2004 U+2005 U+2006 U+2007 U+2008 U+2009 U+200A U+2028 U+2029 U+202F U+205F U+3000 U+000B U+000C U+0085 U+202F U+2007 U+2060

Text direction

I've personally had problems with the U+200E left-to-right mark showing up unexpectedly. I don't think it would annoy anyone, but if you want to be conservative, it could be escaped only when the rest of the string is ASCII. If you don't want to be conservative, we should also escape the U+200F right-to-left mark.

Soft hyphens

U+00AD Soft hyphens also usually don't render at all. Escape them too.

Other control characters

I don't think I've personally ever needed other control characters automatically escaped, but I don't see why not. A conservative approach would be to skip these until someone has had a problem. Here's a list: https://www.fileformat.info/info/unicode/char/200f/index.htm

sebmck commented 4 years ago

I could have sworn I created an issue related to this, but I might have just added a TODO to create one...

Our current strategy for determining when to escape characters isn't very good. Right now anything that isn't ASCII gets output as a unicode escape inside of strings. It's really an awful experience for users working in different languages.

As far as I know, Prettier gets around this by using the raw values from the parser. I don't like this strategy as it requires keeping track of the original source value as code could have been transformed and the string modified. Babel has a hack where it tags on some additional properties so it can be reconciled. This sort of introspection and duplication isn't good to have in an AST.

We should definitely figure out some better heuristics here.

Related, we have a method called showInvisibles at internal/cli-diagnostics/utils.ts which we use when displaying code snippets. It shows invisible characters such as carriage returns, tabs, spaces, null characters, zero-width spaces etc. This makes it much more obvious when we're displaying diffs where you can't visually see any differences. So there's precedent in Rome for us caring about this.

Skalman commented 4 years ago

I assume what you're saying is that there are situations where you don't store the original location, meaning you can't reconstruct the raw value.

I'm not sure that it's possible to have a one-solution-fits-all without keeping at least some raw string values.

Here's me thinking out loud about my needs/preferences:

  1. I always want to escape non-printable chars
  2. I've had a modern environment mangle non-ASCII code (Windows, Visual Studio, Prettier) - so we've had to escape letters that we wouldn't have wanted to escape
  3. I've wanted to use Fontawesome's code points directly in code (e.g. \uf29d) - this I want escaped
  4. I don't want to escape any letters that I can create on my keyboard (for me that's most Latin letters with most diacritic marks)
  5. I want to escape symbols that look like common symbols (e.g. all dashes except -)
  6. More generally, I want to escape chars that look like "common" chars (e.g. а, which is Cyrillic)
  7. I'm ambivalent about emoji. Why escape: Very difficult to type. Why not escape: too long escape sequences, the specific encoding isn't generally too important

And here are my suggestions, assuming you really don't want to keep the original string:

  1. Escape non-printable chars
  2. You'll need an environment that can handle non-ASCII code
  3. Escape private use chars
  4. Don't escape letters (in any language - it's unreasonable to be eurocentric)
  5. Escape dashes and other symbols that are easily confused with common chars (which ones?)
  6. Shouldn't be language-specific
  7. Don't escape emoji