Normalize line endings in multiline strings

DavisVaughan commented 1 day ago

We have this CI failures on Windows https://github.com/posit-dev/air/actions/runs/12042018483/job/33574939950?pr=69

With this test

---- formatter::r::specs::r::value::string_value_r stdout ----
thread 'formatter::r::specs::r::value::string_value_r' panicked at C:\Users\runneradmin\.cargo\git\checkouts\biome-d49ce8898b420a50\2648fa4\crates\biome_formatter\src\builders.rs:394:5:
The content 'r"("some raw string
business")"' contains an unsupported '\r' line terminator character but text must only use line feeds '\n' as line separator. Use '\n' instead of '\r' and '\r\n' to insert a line break in strings.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Which formats a file containing this string

r"("some raw string
business")"

Note that this is a multiline string. The raw string there isn't critical to the reprex, it's just what we happened to have there.

This fails because:

On windows, git checks the file out with CRLF line endings. Note this puts a \r\n inside the string contents.
The formatter fails inside located_token_text() for a string because of this no-\r assertion

Both biome and ruff use this assertion, so I imagine it is for a good reason.

The way both of them handle this is by using variants of normalize_string() here: https://github.com/biomejs/biome/blob/cd1c8ec4249e8df8d221393586d664537c9fddb2/crates/biome_formatter/src/token/string.rs#L48

Note that the docs say one of the things it does is replace \r\n with just \n.

They call this at fmt() time for a string node, before it actually drops through to located_token_text(), so they get a chance to "fix up" the string here. We could probably create our own version of this and strip out everything except the \r\n conversion, until we decide to do more normalization.

Why do they even do this?

I imagine the idea is that typically the line ending doesn't really matter, it becomes part of the Trivia set, but is not part of an actual node. But here, the line ending gets parsed into the actual string contents. I assume they want to ensure that the actual nodes of the tree look the same across OSes regardless of line endings, and standardizing on \n inside multiline strings is the way to achieve that.

lionel- commented 1 day ago

The LSP normalises line endings here: https://github.com/posit-dev/air/blob/003c31bd2f0ec5379065a019c500ebe38fd914a3/crates/lsp/src/documents.rs#L55

DavisVaughan commented 1 day ago

I find this kind of interesting because:

For LSP workflows:

LSP normalizes to all \n endings on the way in (both normal line endings and these weird ones in multiline string ones)
We format
LSP writes out with \r\n endings

For cli workflows:

We format, replacing \r\n in multiline strings with \n, but otherwise leave the \r\n as is

So in cli workflows you can end up with what we want, which is \r\n line endings everywhere except in the multiline strings, which have \n line endings. But in LSP workflows you can't ever end up with the "right" thing on the way out, the last step of replacing all \n with \r\n means that on the way out even multiline strings get \r\n endings again.

There's a decent chance this is why biome retains the original line endings in the LSP side of things

posit-dev / air

Normalize line endings in multiline strings #74

Why do they even do this?