Closed Skalman closed 1 year ago
I could have sworn I created an issue related to this, but I might have just added a TODO to create one...
Our current strategy for determining when to escape characters isn't very good. Right now anything that isn't ASCII gets output as a unicode escape inside of strings. It's really an awful experience for users working in different languages.
As far as I know, Prettier gets around this by using the raw values from the parser. I don't like this strategy as it requires keeping track of the original source value
as code could have been transformed and the string modified. Babel has a hack where it tags on some additional properties so it can be reconciled. This sort of introspection and duplication isn't good to have in an AST.
We should definitely figure out some better heuristics here.
Related, we have a method called showInvisibles
at internal/cli-diagnostics/utils.ts
which we use when displaying code snippets. It shows invisible characters such as carriage returns, tabs, spaces, null characters, zero-width spaces etc. This makes it much more obvious when we're displaying diffs where you can't visually see any differences. So there's precedent in Rome for us caring about this.
I assume what you're saying is that there are situations where you don't store the original location, meaning you can't reconstruct the raw value.
I'm not sure that it's possible to have a one-solution-fits-all without keeping at least some raw string values.
Here's me thinking out loud about my needs/preferences:
\uf29d
) - this I want escaped-
)And here are my suggestions, assuming you really don't want to keep the original string:
Having non-printable characters in the code can be very confusing. I'd suggest automatically reformatting some characters by using their escape sequences instead. Usually, when these types of characters are added to the source code, it's a mistake, and depending on the editor it may well be invisible and therefore very hard/annoying to debug.
Whitespace
I've personally had problems with
U+200B zero-width space
andU+00A0 non-breaking space
. But I think it's reasonable to escape all space characters exceptU+000A newlines
(and probably carriageU+000D returns
),U+0020 spaces
andU+0009 tabs
.Here are other space chars:
U+1680 U+2000 U+2001 U+2002 U+2003 U+2004 U+2005 U+2006 U+2007 U+2008 U+2009 U+200A U+2028 U+2029 U+202F U+205F U+3000 U+000B U+000C U+0085 U+202F U+2007 U+2060
Text direction
I've personally had problems with the
U+200E left-to-right mark
showing up unexpectedly. I don't think it would annoy anyone, but if you want to be conservative, it could be escaped only when the rest of the string is ASCII. If you don't want to be conservative, we should also escape theU+200F right-to-left mark
.Soft hyphens
U+00AD Soft hyphens
also usually don't render at all. Escape them too.Other control characters
I don't think I've personally ever needed other control characters automatically escaped, but I don't see why not. A conservative approach would be to skip these until someone has had a problem. Here's a list: https://www.fileformat.info/info/unicode/char/200f/index.htm