janet-lang / janet

A dynamic language and bytecode vm
https://janet-lang.org
MIT License
3.43k stars 221 forks source link

Long-string parsing for `CR` `LF` vs `LF` eol sequences differ #1302

Closed sogaiu closed 10 months ago

sogaiu commented 10 months ago

The following code yields different results [1] when it contains different end-of-line sequences (CR LF vs LF) [2] (tested with c708ff9708786f06744f92904dd5e2c0d2dd58a2):

(def long-string
  ``
  first line
  second line
  ``)

(pp long-string)

When using CR LF, the result I get is:

"\r\nfirst line\r\nsecond line\r"

When using just LF, the result I get is:

"first line\nsecond line"

I presume the latter form (when only LF is used) is correct in that it removes the leading and trailing eol sequences.

Presumably when CR LF is used for a file, one would want the result to be:

"first line\r\nsecond line"

Update: I think we may not want the CR in this example. See this comment below.


Below are a couple of attempts at adjustment, but perhaps there are better ways :)

A relatively minimal patch that seems to yield consistent results is:

diff --git a/src/core/parse.c b/src/core/parse.c
index 40ccfbf2..a87d7c52 100644
--- a/src/core/parse.c
+++ b/src/core/parse.c
@@ -376,6 +376,12 @@ static int stringend(JanetParser *p, JanetParseState *state) {
                 }
             }
         }
+        /* Adjust things a bit if there is a leading \r\n eol sequence. */
+        if (buflen > 1 && (bufstart[0] == '\r') &&
+                (bufstart[1] == '\n')) {
+            buflen--;
+            bufstart++;
+        }
         /* Now reindent if able to, otherwise just drop leading newline. */
         if (!reindent) {
             if (buflen > 0 && bufstart[0] == '\n') {
@@ -400,9 +406,12 @@ static int stringend(JanetParser *p, JanetParseState *state) {
             }
             buflen = (int32_t)(w - bufstart);
         }
-        /* Check for trailing newline character so we can remove it */
+        /* Check for end of line sequence so we can remove it */
         if (buflen > 0 && bufstart[buflen - 1] == '\n') {
             buflen--;
+            if (buflen > 0 && bufstart[buflen - 1] == '\r') {
+                buflen--;
+            }
         }
     }
     if (state->flags & PFLAG_BUFFER) {

A somewhat clearer (but I have doubts about equivalence) patch is:

diff --git a/src/core/parse.c b/src/core/parse.c
index 40ccfbf2..a24d0875 100644
--- a/src/core/parse.c
+++ b/src/core/parse.c
@@ -376,32 +376,34 @@ static int stringend(JanetParser *p, JanetParseState *state) {
                 }
             }
         }
-        /* Now reindent if able to, otherwise just drop leading newline. */
-        if (!reindent) {
-            if (buflen > 0 && bufstart[0] == '\n') {
-                buflen--;
-                bufstart++;
-            }
-        } else {
+        /* Reindent if needed. */
+        if (reindent) {
             uint8_t *w = bufstart;
             r = bufstart;
+            uint8_t at_newline = 0;
             while (r < end) {
-                if (*r == '\n') {
-                    if (r == bufstart) {
-                        /* Skip leading newline */
-                        r++;
-                    } else {
-                        *w++ = *r++;
-                    }
+                at_newline = (*r == '\n') ? 1 : 0;
+                *w++ = *r++;
+                if (at_newline) {
                     for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++);
-                } else {
-                    *w++ = *r++;
                 }
             }
             buflen = (int32_t)(w - bufstart);
         }
-        /* Check for trailing newline character so we can remove it */
-        if (buflen > 0 && bufstart[buflen - 1] == '\n') {
+        /* Skip leading eol sequence if needed. */
+        if (buflen > 0 && bufstart[0] == '\n') {
+            buflen--;
+            bufstart++;
+        } else if (buflen > 1 && (bufstart[0] == '\r') &&
+                   (bufstart[1] == '\n')) {
+            buflen -= 2;
+            bufstart += 2;
+        }
+        /* Remove trailing eol sequence if needed. */
+        if (buflen > 1 && (bufstart[buflen - 2] == '\r') &&
+                (bufstart[buflen - 1] == '\n')) {
+            buflen -= 2;
+        } else if (buflen > 0 && bufstart[buflen - 1] == '\n') {
             buflen--;
         }
     }

Thanks to @pyrmont for aid in investigation and working on patches.


[1] It's the leading and trailing parts differing that's a problem. That end-of-line sequences that are "interior" differ is to be expected.

[2] In this issue, notationally, CR and "carriage return" mean the same, and likewise for LF and "line feed".

sogaiu commented 10 months ago

It may be that neither of the patches work.

Here's a case I think may not be handled:

(def long-string-2
  ``
  first line

  second line
  ``)

(pp long-string-2)

Expected:

"first line\r\n\r\nsecond line"

Actual:

"  first line\r\n\r\n  second line\r\n  "
sogaiu commented 10 months ago

Updated diffs below. These seem to work better.

Top version:

diff --git a/src/core/parse.c b/src/core/parse.c
index 40ccfbf2..66046e3f 100644
--- a/src/core/parse.c
+++ b/src/core/parse.c
@@ -368,7 +368,8 @@ static int stringend(JanetParser *p, JanetParseState *state) {
         int reindent = 1;
         while (reindent && (r < end)) {
             if (*r++ == '\n') {
-                for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++) {
+                for (int32_t j = 0; (r < end) && (*r != '\n') &&
+                        (*r != '\r') && (j < indent_col); j++, r++) {
                     if (*r != ' ') {
                         reindent = 0;
                         break;
@@ -376,6 +377,12 @@ static int stringend(JanetParser *p, JanetParseState *state) {
                 }
             }
         }
+        /* Adjust things a bit if there is a leading \r\n eol sequence. */
+        if (buflen > 1 && (bufstart[0] == '\r') &&
+                (bufstart[1] == '\n')) {
+            buflen--;
+            bufstart++;
+        }
         /* Now reindent if able to, otherwise just drop leading newline. */
         if (!reindent) {
             if (buflen > 0 && bufstart[0] == '\n') {
@@ -393,16 +400,20 @@ static int stringend(JanetParser *p, JanetParseState *state) {
                     } else {
                         *w++ = *r++;
                     }
-                    for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++);
+                    for (int32_t j = 0; (r < end) && (*r != '\n') &&
+                            (*r != '\r') && (j < indent_col); j++, r++);
                 } else {
                     *w++ = *r++;
                 }
             }
             buflen = (int32_t)(w - bufstart);
         }
-        /* Check for trailing newline character so we can remove it */
+        /* Check for end of line sequence so we can remove it */
         if (buflen > 0 && bufstart[buflen - 1] == '\n') {
             buflen--;
+            if (buflen > 0 && bufstart[buflen - 1] == '\r') {
+                buflen--;
+            }
         }
     }
     if (state->flags & PFLAG_BUFFER) {

Bottom version:

diff --git a/src/core/parse.c b/src/core/parse.c
index 40ccfbf2..ee678814 100644
--- a/src/core/parse.c
+++ b/src/core/parse.c
@@ -368,7 +368,8 @@ static int stringend(JanetParser *p, JanetParseState *state) {
         int reindent = 1;
         while (reindent && (r < end)) {
             if (*r++ == '\n') {
-                for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++) {
+                for (int32_t j = 0; (r < end) && (*r != '\n') &&
+                        (*r != '\r') && (j < indent_col); j++, r++) {
                     if (*r != ' ') {
                         reindent = 0;
                         break;
@@ -376,32 +377,35 @@ static int stringend(JanetParser *p, JanetParseState *state) {
                 }
             }
         }
-        /* Now reindent if able to, otherwise just drop leading newline. */
-        if (!reindent) {
-            if (buflen > 0 && bufstart[0] == '\n') {
-                buflen--;
-                bufstart++;
-            }
-        } else {
+        /* Reindent if needed. */
+        if (reindent) {
             uint8_t *w = bufstart;
             r = bufstart;
+            uint8_t at_newline = 0;
             while (r < end) {
-                if (*r == '\n') {
-                    if (r == bufstart) {
-                        /* Skip leading newline */
-                        r++;
-                    } else {
-                        *w++ = *r++;
-                    }
-                    for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++);
-                } else {
-                    *w++ = *r++;
+                at_newline = (*r == '\n') ? 1 : 0;
+                *w++ = *r++;
+                if (at_newline) {
+                    for (int32_t j = 0; (r < end) && (*r != '\n') &&
+                            (*r != '\r') && (j < indent_col); j++, r++);
                 }
             }
             buflen = (int32_t)(w - bufstart);
         }
-        /* Check for trailing newline character so we can remove it */
-        if (buflen > 0 && bufstart[buflen - 1] == '\n') {
+        /* Skip leading eol sequence if needed. */
+        if (buflen > 0 && bufstart[0] == '\n') {
+            buflen--;
+            bufstart++;
+        } else if (buflen > 1 && (bufstart[0] == '\r') &&
+                   (bufstart[1] == '\n')) {
+            buflen -= 2;
+            bufstart += 2;
+        }
+        /* Remove trailing eol sequence if needed. */
+        if (buflen > 1 && (bufstart[buflen - 2] == '\r') &&
+                (bufstart[buflen - 1] == '\n')) {
+            buflen -= 2;
+        } else if (buflen > 0 && bufstart[buflen - 1] == '\n') {
             buflen--;
         }
     }

Thanks to @pyrmont for the fix!

sogaiu commented 10 months ago

So thinking over it some more, it seems problematic for literal long-strings to yield different values depending on the end-of-line sequence used within a source file.

As additional data, the REPL behavior is consistent between Linux and Windows:

repl:1:> ``hello
repl:2:``> there``
"hello\nthere"

Perhaps each CR that appears right before a LF character in a long-string literal should be dropped?

That doesn't seem completely problem-free [1], but may be it's a better behavior than the current situation.


[1] May be for nearly every case (and likely case) it would be fine?

sogaiu commented 10 months ago

So perhaps something like this:

diff --git a/src/core/parse.c b/src/core/parse.c
index 40ccfbf2..59908f24 100644
--- a/src/core/parse.c
+++ b/src/core/parse.c
@@ -368,7 +368,8 @@ static int stringend(JanetParser *p, JanetParseState *state) {
         int reindent = 1;
         while (reindent && (r < end)) {
             if (*r++ == '\n') {
-                for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++) {
+                for (int32_t j = 0; (r < end) && (*r != '\n') &&
+                        (*r != '\r') && (j < indent_col); j++, r++) {
                     if (*r != ' ') {
                         reindent = 0;
                         break;
@@ -376,6 +377,12 @@ static int stringend(JanetParser *p, JanetParseState *state) {
                 }
             }
         }
+        /* Adjust things a bit if there is a leading \r\n eol sequence. */
+        if (buflen > 1 && (bufstart[0] == '\r') &&
+                (bufstart[1] == '\n')) {
+            buflen--;
+            bufstart++;
+        }
         /* Now reindent if able to, otherwise just drop leading newline. */
         if (!reindent) {
             if (buflen > 0 && bufstart[0] == '\n') {
@@ -393,16 +400,24 @@ static int stringend(JanetParser *p, JanetParseState *state) {
                     } else {
                         *w++ = *r++;
                     }
-                    for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++);
+                    for (int32_t j = 0; (r < end) && (*r != '\n') &&
+                            (*r != '\r') && (j < indent_col); j++, r++);
+                } else if ((*r == '\r') &&
+                           (r + 1 < end) && (*(r + 1) == '\n')) {
+                    /* Skip carriage return that is right before a newline */
+                    r++;
                 } else {
                     *w++ = *r++;
                 }
             }
             buflen = (int32_t)(w - bufstart);
         }
-        /* Check for trailing newline character so we can remove it */
+        /* Check for end of line sequence so we can remove it */
         if (buflen > 0 && bufstart[buflen - 1] == '\n') {
             buflen--;
+            if (buflen > 0 && bufstart[buflen - 1] == '\r') {
+                buflen--;
+            }
         }
     }
     if (state->flags & PFLAG_BUFFER) {
sogaiu commented 10 months ago

Perhaps each CR that appears right before a LF character in a long-string literal should be dropped?

I think this is how Python3's multiline string literals work [1] too. For these two bits of code:

s = """a
^Mj
z"""

print(len(s))
s = """a^M
^Mj^M
z"""^M
^M
print(len(s))^M

where ^M represents a CR , the output is 6.

That is, whether the line endings are LF or CR LF, the multiline string has the same length.

I think the patch in the previous comment implements this behavior.


[1] Haven't tracked down the source, but this bit might be relevant. I think this is part of "Universal Newline Support" defined in PEP 278.

According to Wikipedia (see near end of the linked section):

Python permits "Universal Newline Support" when opening a file for reading, when importing modules, and when executing a file.

sogaiu commented 10 months ago

I found summaries regarding some programming languages line termination handling starting from here.

Languages mentioned include:

One unexpected issue for me was potential security implications of \r-handling. Not imaginitive enough (^^;

Note that most of this is from around 10 years ago.

Also of referential interest might be this Wikipedia article about Newline that was mentioned in the above issue. Of particular note might be the "In programming languages" section.


Apparently in Rust:

According to the Raw string literals section in the reference the "raw string body can contain any sequence of Unicode characters" and "[a]ll Unicode characters contained in the raw string body represent themselves". However in such a raw string carriage return + newline sequences (U+000D U+000A) are replaced by a single newline char (U+000A).

This is from 2021.

That seems similar to the Python3 behavior described above.

sogaiu commented 10 months ago

Trying to spell out some rationale for long-strings (and long-buffers).

So far I have:

ATM, the website docs have this text:

Long-strings do not contain escape sequences; all bytes will be parsed literally until the ending delimiter is found. This is useful for defining multi-line strings containing literal newline characters, unprintable characters, or strings that would otherwise require many escape sequences.

My suspicion at the moment is that the use-case of constructing a multi-line source string via a literal where the end-of-line delimiters are not converted to LF is not really addressed widely in other programming languages.

IIUC, in most C implementations(?), end-of-line sequences get converted to LF pretty early. The following is from 5.2.1 from C99 (at least a draft I found):

In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character.

I don't know if this set a precedent which influenced some newer programming languages, but wouldn't be surprised if it did.

sogaiu commented 10 months ago

ATM, it seems to me that parsing might be improved so that source produced in a typical Windows environment might work more consistently with that produced in a more Unix-y environment [1].

There are two obvious ways to me (may be I missed some) one might do this:

  1. Drop CR when constructing a long-string
  2. Replace CR LF sequences with LF

I think the former is simpler to implement, but it seems less expressive as one would not be able to get a string or buffer containing a CR in it from a long-string literal or long-buffer literal. (The patch above is an attempt to do 2., FWIW).

Below are a couple of bits about what Rust does for comparative purposes...

AFAICT, Rust does 1. (drops CR) for // comments. Perhaps for that use case, it's sensible.

For raw string literals though, apparently Rust only replaces CR if it isn't part of a CR LF sequence, so does 2. Again, may be that makes sense for the use cases one might want to cover.

I bring these bits up to try to aid in reasoning about what to do for Janet [2]. My previous comment in this issue tries to spell out some rationale for long-strings (which includes some use cases) to further inform thinking about this situation.


[1] To give a bit of background, this came up for me because tests I had written behave differently on Windows than they do on Linux.

[2] I think impementing 1. or 2. is better than not changing the current situation, but perhaps others have different opinions...

bakpakin commented 10 months ago

This is all good discussion, and I do think we should strip out in some cases, but a goal of long strings is to be able to encode any binary sequence, so just dropping all CR doesn't quite work here.

There is some special case at the beginning and end of the strings to allow for the form:


``
Text here
``

to not have leading and trailing lines, and allow for indentation. So I'm not sure about how I feel about doing anything to the internal contents if there is no indentation

sogaiu commented 10 months ago

Do you see problems with a transformation that replaces CR LF with LF? (I think this might be doable while maintaining the existing behavior for handling of indentation and the trimming of single leading / trailing end-of-line sequences.)

One thing I think isn't great about this idea is that it makes it awkward (and possibly brittle) to construct multiline strings that include CR LF [1].

It seems doable to me if one constructs the long-string literal by using CR CR LF as the bytes that come at the end of each line (so-to-speak). When the corresponding string is made, the last 2 bytes (CR LF) per line would be replace by a single LF byte resulting in CR + LF (so 2 bytes instead of 3).

I'm not sure I'd want to write such long-string literals if they have many lines because it doesn't seem ergonomic and I wonder if some editors (and perhaps version control systems) might mess with my source, corrupting my data...


[1] I haven't come across a programming language that supports ergonomic construction of this type of data -- so may be it doesn't come up so much in practice? May be that's wishful thinking...

bakpakin commented 10 months ago

Closing this as a WNF. The behavior gets too complicated and annoying, mostly to deal with editors and system that try to change how your source code is formatted.