Open richard-viney opened 7 months ago
Thank you
This is a bit subtle, but as far as I can tell from skimming through the Erlang/OTP string
module is that the functions are treating the \r\n
as a single grapheme cluster.
First, check out string:length
which returns the number of grapheme clusters the given string.
12> string:length("\r").
1
13> string:length("\r\n").
1
14> string:length("\n").
1
15> string:length("\n\n").
2
16> string:length("\r\r").
2
For trying to use the "\r"
as the needle to replace in the haystack "\r\n"
, here is my understanding from following along with the code to see why \r
will not work like the \n
or \r\n
.
split
split_1
prefix_1
unicode_util:gc
on the \r\n
, which returns ["\r\n"]
(ie one grapheme cluster), which will then return the nomatch
from prefix_1
when searching for \r
nomatch
, so then split_1
is called again with the tail of "\r\n"
. But tail of that thing is the "\n"
because the strings in erlang are list of codepoints:1> [Head|Tail] = "\r\n".
"\r\n"
2> Head.
13
3> Tail.
"\n"
So, (as far as I can tell from going through the Erlang/OTP code right now), that's how the \r\n
can match, and the \n
can match, but not the \r
by itself.
Yep agreed it's due to the \r\n
being treated as a grapheme cluster, this is mentioned in several places including here: https://www.erlang.org/doc/man/string#trim-3:
"Notice that [$\r,$\n] is one grapheme cluster according to the Unicode Standard."
So the question is which of JS or Erlang is the desired/intended behaviour? Should it replace by codepoint or by grapheme cluster 🤔.
Sounds like the JS implementation is incorrect.
Erlang's splitting behaviour seems inconsistent though Currently when you call the split function for either the Erlang or Javascript target we end up with this:
import gleam/should
import gleam/string
pub fn split_newline_test() {
string.split("\r\n","\n")
|> should.equal(["\r",""])
}
Meaning Erlang just like Javascript is perfectly ok with removing the \n
from a \r\n
grapheme cluster.
However Erlang is not ok with
import gleam/should
import gleam/string
pub fn split_newline_test() {
string.split("\r\n","\r")
|> should.equal(["","\n"])
}
whereas Javascript is.
I would have expected Erlang to reject attempting to split a \n
on a \r\n
cluster if it was always treating \r\n
as a separate thing from either \r
or \n
So if we want to fully copy what Erlang does we need to ensure Javascript still DOES split on \n
which is a bit odd...
Edit: The most "correct" behaviour to me would be either:
1) Make \r\n
split on neither \r
nor \n
or
2) Treat \r\n
as \n
so that when you split on \n
it removes the entire cluster \r\n
import gleam/should
import gleam/string
pub fn split_newline_test() {
string.split("thing\r\nthing","\n")
|> should.equal(["thing","thing"])
}
Both sequences are semantically the same in most cases and leaving a dangling \r
would be very odd to me.
I was expecting
string.replace
to treat control characters like any other code points, but then ran into this behaviour which looks wrong.The issue is only on the Erlang target, works fine under JS.
I can dig a bit deeper if it would be helpful, possible that other control characters are affected too.