Permit more control characters in comments

eksortso commented 1 year ago

In reconsideration of the limitation of control codes within comments discussed in #567, a case was made to simplify comment parsing. This PR intends to replace those previously imposed restrictions with a simplified approach that permits anything except newlines.

ChristianSi commented 1 year ago

I think I could get on board with this. However, I don't understand why this seems to redefine the concept of a newline. The spec already clearly says: "Newline means LF (0x0A) or CRLF (0x0D 0x0A)."

So, so comment ends at a newline, whether it's LF or CRLF. That's very clear and non-negotiable. So I don't understand what this language regarding "exclude a final carriage return" is meant to say.

Another issue is whether a lone CR (0x0A) should be allowed and ignored in comments. The PR current does so, but I think it should rather be rejected. Some not fully compliant parsers might treat CR as a linebreak. That would allow smuggling data into a document that a compliant parser ignores as part of a comment. If such CR's are rejected this risk is reduced since the compliant parser would then reject the document altogether.

ChristianSi commented 1 year ago

So I would suggest to change the wording to:

All characters except line feed (0x0A) and carriage return (0x0D) are permitted in comments and will be ignored.

The ABNF would need to be adjusted as well.

arp242 commented 1 year ago

Also should to use U+ syntax instead of 0x for codepoints.

The way it's phrased now is kinda confusing; U+0A is the newline; this is already allowed in comments, because every line ends with a comment.

In general, I don't really see the point of this though; no one raised any practical concerns over this ("I tried to do X but I can't and the specification limits me") and it just seems to introduce needless churn for implementers who will need to update the implementations, tests, etc. I would not be in favour of merging this.

eksortso commented 1 year ago

@ChristianSi What I was attempting to do was not redefining the concept of newlines. But let me come back to that.

I expect comments to ignore everything to the end of the line. For practical purposes, that means starting from a free-standing hash character, scanning ahead until the first LF character (or the EOF) is found, and disposing of everything from the hash up to but not including that LF character. And by everything, I mean everything, including CR characters. Call it a "naive" approach to comment disposal.

Your concern about "smuggling data into a document" would be mostly baseless, except for the existence of comment-preserving parsers. Such a parser cannot use the naive approach described previously, because if the TOML document uses CR LF for newlines, the final CR must be excluded from the comment registered by the parser. The last sentence of my change to toml.md was intended to allow for this. But since it does not seem necessary in retrospect, I'm deleting that sentence.

Strike the entire paragraph, and replace the first paragraph in the Comment section with the following, which simplifies the thing further:

A hash symbol marks the rest of the line as a comment, except when inside a string. All code points except line feeds (U+000A) are permitted in comments.

I explicitly say "code points" here because that includes all characters, all ASCII control codes, all U+everything, except for a U+000A, which is a newline in TOML. It makes no sense to make an exception for U+000D, except to parsers that read all newlines first. And the authors of those parsers should be savvy enough to know what to keep if they're preserving a comment.

@arp242 This PR addresses a number of points raised by @abelbraaksma in #567, specifically getting rid of (almost) all restrictions on characters permitted within comments, because we do not want a document rejected just because a control character shows up in a comment.

And if you've ever copy/pasted Excel cells into a text widget before, you do come across stray CR characters far too often.

arp242 commented 1 year ago

@arp242 This PR addresses a number of points raised by @abelbraaksma in #567, specifically getting rid of (almost) all restrictions on characters permitted within comments, because we do not want a document rejected just because a control character shows up in a comment.

And if you've ever copy/pasted Excel cells into a text widget before, you do come across stray CR characters far too often.

Yes, I read that; but I don't see any practical problems. And both LF and CRLF end of lines are already allowed; if the ABNF excludes that then that's a bug in the ABNF.

eksortso commented 1 year ago

And both LF and CRLF end of lines are already allowed; if the ABNF excludes that then that's a bug in the ABNF.

A CR LF line ending is still permitted, because if you follow the ABNF to the letter, the CR in the newline is considered part of the content of the comment; it's all in comment and gets picked up in *non-eol. What remains is the LF, which is still a newline. So the ABNF is not ambiguous. If you are preserving comments, you can exclude the final CR in the comment if the newline immediately follows.

ChristianSi commented 1 year ago

OK, I think I can live with lonely CR's being ignored.

And I agree with @arp242 that even the wording

All code points except line feeds (U+000A) are permitted in comments.

is still a bit confusing – after all, if you add an LF to a comment, you don't produce an error, but simply close the comment. Maybe it would be better to express this as something like:

All valid code points are permitted in comments (but keep in mind that a linebreak – LF or CRLF – terminates the comment).

Another issue that needs to be addressed is Unicode validity. Wikipedia writes:

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

Explaining the calculating in a footnote:

17 planes times 2^16 code points per plane, minus 2^11 technically-invalid surrogates.

What to do with these surrogate code points? They are "technically invalid" and have no place in a well-formed UTF-8 string. Hence they are not allowed in TOML strings, for good reasons. But should their inclusion in a comment be tolerated, or should it cause the document to be rejected as invalid? As I argued earlier, it might be reasonable to allow either behavior, stating something like:

Surrogate code points (U+D800 to U+DFFF) should never be used in a TOML document, not even in comments. Parsers encountering such a code point may reject the document, but don't have to do so.

There may be a few other code points to which the same reasoning applies, but I'll leave that to the Unicode experts.

eksortso commented 1 year ago

@ChristianSi You said:

Maybe it would be better to express this as something like:

All valid code points are permitted in comments (but keep in mind that a linebreak – LF or CRLF – terminates the comment).

That makes sense. In fact, let's remove the parentheses. The definition of a newline immediately precedes this section, so it ought to be clear why we mention LF and CRLF. Here's my reworked version.

All valid code points are permitted in a comment, but keep in mind that a LF or CRLF terminates the comment.

I weighed the value of keeping the word "valid" in there. I was presuming that the document was already verified as being valid UTF-8, meaning no illegal bytes and no byte streams representing surrogate code points (among others) would be present. If a TOML document is read into a string for processing, then validation is performed before the hash symbol is even processed as the start of a comment. But you are coming from a perspective in which the validity of the document's byte stream is yet to be verified. For languages that live close to the bare metal, that could be the case. But surrogates are always invalid in UTF-8, and if they are present in any text document, then that's a bigger problem than a TOML parser ought to handle. Certainly bigger than isolated CRs from Excel.

Invalid byte streams, including surrogate code points, should be rejected outright. Parsers should not tolerate them in comments, as they should not tolerate invalid UTF-8 documents in general. But we can still say "valid" code points, so there's no ambiguity whatsoever.

I will make one modification to the ABNF: rename non-eol to non-linefeed. It isn't used anywhere else, and the new name doesn't presume the line ending. I tested it in Instaparse Live, which was smart enough not to suck in the CR in the CRLF into the comment.

hukkin commented 1 year ago

As an implementation maintainer I'm slightly opposed to this change simply because it seems this isn't something that users are asking for but rather something that toml-lang contributors think would be an improvement. Seems like potential busywork for implementation maintainers.

I'm also concerned about allowing lone CR in comments. Assuming that text editors render lone CR as a line ending, I think it creates an opportunity to craft valid TOML documents that look different to human eye than a compliant parser. For example:

# All line endings in this document are lone CRs
a = "hello"
b = "world"

Assuming that all line endings in the document are CRs, the document is still valid. It is a single line TOML document consisting of one comment. I.e. the content, when read by a compliant parser is an empty mapping {}. To human eye it looks like {"a": "hello", "b": "world"} though.

eksortso commented 1 year ago

Assuming that all line endings in the document are CRs, the document is still valid

The only places where I've seen single CRs as line endings are in text documents on pre-OSX MacOS systems and in text cells in Excel. Since we are not creating the TOML standard for twenty-year-old Macs, uncoupled CRs created by text editors ought to be a rare occurrence.

When comments were first defined in TOML, there were no restrictions on control codes in comments. That was changed a few years ago when all control codes except horizontal tabs were excluded.

Was this a burden for parser writers to handle when it was introduced? A little, though I'd have to check if they were doing it right. Was it a limitation that users were asking for? It was not, based on what I've read here.

Will existing parsers have to relax these limitations if this PR is merged? Well yes, but they would also be making changes to implement other new features in the future TOML v1.1.0.

Do I have a stake in this? Just in how users copy whole Excel cells into text widgets and the pain I've experienced in stripping loose CRs out of database columns when they are not properly removed or converted. Naive users won't be flummoxed by CRs in comments any longer. This PR doesn't address single CRs showing up anywhere else though.

hukkin commented 1 year ago

The only places where I've seen single CRs as line endings are in text documents on pre-OSX MacOS systems and in text cells in Excel. Since we are not creating the TOML standard for twenty-year-old Macs, uncoupled CRs created by text editors ought to be a rare occurrence.

I agree lone CRs are a rare occurrence. My concern is someone with malicious intent creating a TOML document that looks different to human eye compared to a compliant parser. Status quo, where the compliant parser errors, may be safer than what this PR proposes.

ChristianSi commented 1 year ago

I think @hukkin's concern is valid if editors tend to interpret lone CR's as newlines. What you see should be what you get, so a TOML file that has different linebreaks in a text editor than what a parser sees it not a good idea.

So what do text editors do? I saved @hukkin's CR-only example from above as a text file:

with open('test.txt', 'w') as f:
    f.write('# All line endings in this document are lone CRs\ra = "hello"\rb = "world"\r')

and opened it in a few editors, with the following results:

vim/gvim renders them as ^M, not linebreak
libreoffice ignores them altogether
nano renders them as linebreaks and shows "Converted from Mac format"
gedit renders them as linebreaks and doesn't show any warning, as far as I could see.

You are all invited to make the same experiment with the editors you commonly use or can think of, and report back.

For me I must say that two our of four is too close for comfort. Gedit's behavior is especially problematic, and since it's the default GUI text editor in Ubuntu (the one that opens when you double-click a text file), I assume it's pretty widely used.

Therefore I return to my earlier position that lone CRs should remain forbidden in comments, even if all other control characters are allowed.

eksortso commented 1 year ago

@ChristianSi I tried basically the same thing in several Windows 10 text editors. I altered your code slightly, writing the same file in binary mode:

with open('testCR.txt', 'wb') as f:
    f.write(b'# All line endings in this document are lone CRs\ra = "hello"\rb = "world"\r')

When I opened them, these were the results:

Windows 10 Notepad renders them as linebreaks and shows "Macintosh (CR)" as the line ending.
Notepad++ does the same thing. The line ending status allows you to modify the line ending to "Windows (CR LF)" and "Unix (LF)" and back to "Macintosh (CR)".
VS Code, out of the box, renders the CRs as linebreaks but incorrectly identifies the line ending as "CRLF". The hex editor does show the 0D codes correctly though. The line ending status will allow you to switch to LF; switching to CRLF does nothing immediately. There's no option to return to CR.
Notepad2 (an old reliable workhorse) renders the CRs as linebreaks and identifies the line ending as "CR".
micro (a new reliable workhorse) incorrectly renders the CRs as whitespace and also misidentifies the line ending as "unix". Among these editors, micro is the newest, so the disregard for CRs can't be unexpected.

I also ran type testCR.txt in both cmd and PowerShell 7. In cmd, typing out the document caused each subsequent line to overwrite the beginning of the previous line, adhering closely to CR's original meaning as a carriage return without a line feed. In PowerShell 7, the CR's were treated as linebreaks.

The mixed behaviors are discomforting. So despite my misgivings expressed previously, I'm changing my mind. Lone CRs ought to be forbidden in comments.

But @hukkin's position is to retain the status quo. Keep all control codes except tab out of comments.

Let me run one more acid test. What happens when the DEL control code, U+007F, is present?

with open('testDEL.txt', 'w') as f:
    f.write("""\
# All lines in this document end with a DEL character.\x7f
a = "hello"\x7f
b = "world"\x7f
""")

Notepad and Notepad2 show the DEL character as an unknown character, with a rectangular placeholder. VS Code and Notepad++ show the character as a "DEL" symbol, the former even displaying the symbol in red. micro shows them as whitespace characters. type testDEL.txt showed DEL as a pentagon-shaped placeholder when run in both cmd and PowerShell 7.

@ChristianSi @abelbraaksma You may want to test how these appear in *nix environment text editors, and how they appear when you cat or bat them. (I ought to install WSL2 someday.)

The similarities in DEL's appearances, though less glaring, still suggest that allowing other control characters in comments may still be sensible. But I am starting to think that, per @hukkin, keeping the status quo is the best route for us to take. I'm not about to close this PR though; in fact, I will disallow CR characters and replace the "0x"s with "U+00"s in toml.md where appropriate, per @arp242.

@abelbraaksma I'm beginning to consider closing this PR, so if you want to continue making your case, now's the time to do it.

As for anyone else with a stake in this, including @pradyunsg @BurntSushi @mojombo, what are your thoughts?

marzer commented 1 year ago

I haven't really been following this discourse, but FWIW my position is this:

status quo bad
but lone CR also bad

abelbraaksma commented 1 year ago

@abelbraaksma I'm beginning to consider closing this PR, so if you want to continue making your case, now's the time to do it.

@eksortso It's been a while since I wrote my original opinion. But from your comments above, I think we have basically two things to consider:

Line ending normalization
Any Unicode character allowed in comments

Thoughts on line endings

I think the first point above belongs to editors. However, if you compare it to some W3C standards, they are liberal in what they accept, but require normalization. That is, they allow CRLF, lone CR, lone LF. Not all programming platforms do this automatically, though (I mean, in general, not specific to XML or anything W3 related).

However, the algorithm is ridiculously simple:

All line endings must be normalized before parsing
CRLF is interpreted as single LF
any CR not followed by an LF is interpreted as a single LF

Henceforth, line endings are then normalized, and what we call a new line or carriage return in Unicode is interpreted as such and always start a new line.

In practice, this would mean removing the following from the ABNF:

newline =  %x0A     ; LF
newline =/ %x0D.0A  ; CRLF

and making newlines implicit (i.e., part of the prose, not the ABNF). Or, conversely, since alternatives in parsing are ordered, this could work, I think (which, btw, already suggests that the current code is actually incorrect?):

newline =/ %x0D.0A  ; CRLF
newline =/ %x0D     ; CR
newline =  %x0A     ; LF

Thoughts on control characters

Currently, it is clear to me that comments are overly restrictive (i.e., control characters are absent), but also too lenient (illegal unicode characters are allowed). The range, as it currently stands is:

comment-start-symbol = %x23 ; #
non-ascii = %x80-D7FF / %xE000-10FFFF
non-eol = %x09 / %x20-7E / non-ascii

comment = comment-start-symbol *non-eol

If we expand this, we get:

allowed-comment-char =%x09 / %x20-7E / %x80-D7FF / %xE000-10FFFF

Let's break this down:

Control characters (the current discussion) are disallowed, but my point in the original issue was: "encounder #, then read to EOL" is easier to understand, and at least parsers won't need to raise to a user something like: "hey, you wrote a comment, and we know you just want to ignore whatever is in there, but we cannot do that, there are invisible characters that we don't like, please remove them".
%7F is the DEL character. This is just a valid Unicode character, albeit without a glyph, but that's true for so many. I'd just allow it.
%D800 - %xDFFF, the so-called "surrogate block". These have special meaning, but only in UTF-16, when encountered as literal bytes in the stream. ~However, they themselves can be encoded just fine, both in UTF-16 and UTF-8, so there is no specific need to disallow them.~ see below for a better explanation.
%FFFE and %FFFF. From all the noncharacters in Unicode, these are special, as they are used as BOM (byte order mark). It is certainly discouraged to use these in any text document. Even in UTF-8, it often appears in the start to signal that the document should be interpreted as UTF-8 and not ASCII or Latin-1 or something. But, if appearing at the start of a document, they may be special, there is nothing that disallows UTF-8 to encode this, even though the characters themselves have no meaning (unless software assigns special meaning). This used to be different, but has changed long ago, see section 9.3 in these minutes: https://www.unicode.org/wg2/docs/n2403.pdf.
Other noncharacters: basically anything that contains FFFE or FFFF is considered a non-character in Unicode. I.e. %2FFFE and %10FFFE are non-characters. Again, there use is discouraged, because both UTF-16 and UTF-32 use these for endianness detection (i.e., FFFE means big endian and FEFF means little endian). But again, in UTF-8, there's nothing wrong in encoding these, it's just bad practice to use them.
%FFFD the Replacement Character. This one is currently allowed, but problematic in many ways. A lot of editors, when encountering invalid encodings (not to be confused with invalid code points), replace these with this character. It is used for cases where the parser basically says "I don't know what you mean". Since it's already allowed, it should stay that way, I guess.
%00. The NUL character is allowed in Unicode, but many standards disallow it explicitly. The reason is simple: a lot of programming languages represent a string in memory using a NUL delimiter, like C/C++. While this may not be the case for many other languages, there have been recorded cases where DOS and buffer overflow attacks where successful using NUL in Unicode strings. TLDR: I'd vote for certainly still disallowing this character anywhere in the document.
%85, %2028 and %2029 (see https://en.wikipedia.org/wiki/Newline#Unicode). Different standards give these different meanings. %85 is NEL and that code point is special as it maps to some old EBCDIC newline character. The others were "invented" for Unicode and I've rarely seen them in practice. Since we currently don't consider these newlines, I don't think we can change that (or should even consider).
%0B and %0C are Form Feed FF and Vertical Tab VT. While modern editors mostly ignore there (and just display some replacement char), there will be editors out there that "correctly" interpret these the ASCII way. That is: apply a form feed (new line/page/para) or vertical tab (never seen this in practice). It is probably best to continue to disallow these anywhere in the TOML document.

Conclusion?

Sorry for the long post (it didn't set out that way!). Basically, we have this situation:

We already allow characters that are strongly discouraged to appear in Unicode
We disallow characters that Unicode has absolutely no problem with

My vote: let's take the union of these two and simplify parsing. A comment can just contain any Unicode character, with just two exceptions: line endings (including FF and VT), and NUL. I.e., it would become this:

allowed-comment-char =%x01-%09 / %0E-10FFFF

Or alternatively (that is, allowing FF and VT, as these are really just a remnant of the past):

allowed-comment-char =%x01-%09 / %0B / %0C / %0E-10FFFF

Meanwhile, I do think we should correct the newline handling.

I realize this may be a little too lenient for some, but I see no harm in doing so. Any existing correct TOML file out there still parses just fine, and once this is adopted, parsers won't have to treat comments any different than "ignore from # to EOL`.

ChristianSi commented 1 year ago

Hmm, I don't think we should allow lone CR's as linebreaks – quite simply, since we had discussed that in the past and decided against it. I'm not sure where that was, but anyone willing to search should be able to find it. And just reversing one's position every few years without really new arguments having come up is not a good thing.

More specifically, there are currently no TOML files that use CR's as linebreaks. And since this linebreak convention isn't used anywhere anymore, except maybe on some computers that have long ago ended up in museums, there is just no reason why they should ever come into existence.

As for @abelbraaksma's suggestions on allowed-comment-char, the first variant seems fine to me. But CR shouldn't be interpreted as linebreak, it should simply remain forbidden (and hence trigger an error, just like FF, VT, and NUL).

abelbraaksma commented 1 year ago

@ChristianSi thanks for clarifying. I’m not aware of the old decisions, I wholeheartedly agree to staying consistent.

I have seen such file in the wild, usually as a result of bad switching between platforms or editors mistreating LFs (try opening such files in Notepad, then manually adding newlines on top of those).

But bad editing is no reason for allowing such things, I agree. And the times that DNX had LFCR (as opposed to CRLF) is indeed far in the past.

eksortso commented 1 year ago

I'll use @abelbraaksma's first allowed-comment-char variant. So we will be disallowing NUL, CR, FF, VT, and LF in comments (though the last one won't generate an error because it would simply end the line).

The matter of normalizing line endings (just LF and CRLF in TOML's case) is appealing to me, especially since we're going to someday form an RFC based on the TOML spec, per #870. But seeing as text editors can unintentionally mix line endings, I won't push it.

eksortso commented 1 year ago

A few more notes. I do not like surrogate code points being encoded in UTF-8, because it is in fact illegal to have such byte streams in UTF-8. And in fact, we banned surrogates from TOML documents in #619 three years ago with the introduction of non-ascii in the ABNF.

But to simplify comment scanning while disallowing the historic line-breaking ASCII codes (please help: what's a better name for the collected set of codes CR, FF, VT, and LF?), I'll use the new allowed-comment-char = %x01-%09 / %0E-10FFFF, and relocate non-ascii elsewhere closer to where it is first used.

Expect a little rearrangement.

eksortso commented 1 year ago

And now that I think about it, I ought to strip out the word "valid" in "All valid code points" since we're allowing invalid code points in comments now. But somebody give me a better description of what %x0A through %x0D should be called collectively.

abelbraaksma commented 1 year ago

I wrote this about surrogate codepoints:

However, they themselves can be encoded just fine, both in UTF-16 and UTF-8, so there is no specific need to disallow them.

~This isn't (entirely) correct, apologies~ it's just wrong, except for the conclusion. I remembered things wrongly and didn't do enough research, apologies. The short summary of a long reading of specs:

In the past, surrogate characters where encoded as two pairs of three UTF-8 code units (bytes). There is still software out there that does this, but it is considered invalid. A historical Unicode encoding, CESU-8, supports this, though. UTF-8 readers encountering this should either signal with an error, or display the U+FFFD (replacement char, or �).
Proper encoding of single surrogate characters (not being in pairs) does not exist (unless historically, CESU-8), and when encountered, should be replaced, again, by U+FFFD.
Proper encoding of proper pairs of surrogates, encountered as transcoding from a UTF-16 stream, must be encoded as a single 4-unit (byte) UTF-8 stream. It is illegal to do otherwise (illegal from a standards perspective, I doubt you'll be persecuted).
Encoding of surrogate code-points as themselves in UTF-16 is not possible. They have special meaning.
Lone surrogate code code-points in UTF-16 are, again, illegal. They can either be signaled, or replaced by �.

What does this mean for us?

Not much, really. TOML currently requires, encoding as UTF-8. What this means is, that it is impossible to encounter surrogate characters.

If you do encounter them, you'll encounter them prior to reading them as UTF-8. By the time your stream turns from a byte stream into a proper UTF-8 stream, those (illegal) surrogate pairs are no longer there, they must be replaced by �, or, alternatively, have raised an error: "Illegal file encoding detected", or something like that.

So, what does this really mean for us?

We should not treat surrogate characters as special. They don't exist. The only place where we should disallow them, is in escape sequences. I.e., you should not be allowed to write \D800 in a string. (and we do, we actually point to the definition of Unicode scalar values).

If we still have remnants of ABNF dealing with surrogates, we should really (probably?) remove them. Or we should remove the requirement for TOML being valid UTF-8, in which case it makes sense to embed the Unicode rules in the spec.

For reference, this FAQ summarizes the above as well: UTF-8 FAQ

abelbraaksma commented 1 year ago

The above really is an in-depth answer to:

I do not like surrogate code points being encoded in UTF-8, because it is in fact illegal to have such byte streams in UTF-8. And in fact, we banned surrogates from TOML documents in https://github.com/toml-lang/toml/pull/619 three years ago with the introduction of non-ascii in the ABNF.

The issue you refer to makes the wrong assumption. It conflates "UTF-16 surrogate code-points" and "UTF-8 encoding". Futhermore, UTF-16 cannot represent these code-points either, but as code unit in UTF-16 they have meaning. These code-points are deemed non-representable. They don't even have a table in Unicode. Quote:

Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range

And:

For a complete understanding of high-surrogate code units low-surrogate code units, and surrogate pairs used for the UTF-16 encoding form, see the appropriate sections of the Unicode Standard, online a

Note that this second comment talks about code units, not code-points. The latter "have no interpretation", the former do.

We should be rather careful about our wording when it comes to this. And perhaps we should even add a Note to the ABNF (it's quite common to have this type of notes in specs), explaining that any TOML byte stream is to be interpreted as a valid UTF-8 byte stream and that, as such, only valid code units should be considered, and invalid code units should be, as per the Unicode spec, be considered as U+FFFD, or raise an exception.

This includes the many other, potentially invalid encodings one may encounter in a UTF-8 document. In fact, we should just defer to the Unicode spec itself here.

abelbraaksma commented 1 year ago

I could have been briefer in my response. In #619 we did not ban surrogates, they were never allowed to begin with. We merely added a redundancy in the spec, a rule about a condition that didn't exist ;).

please help: what's a better name for the collected set of codes CR, FF, VT, and LF?

vertical-whitespace-characters? Hmm, yeah, awkward however you try to summarize this set 😆.

ChristianSi commented 1 year ago

@abelbraaksma: Thanks for the detailed explanations!

And perhaps we should even add a Note to the ABNF (it's quite common to have this type of notes in specs), explaining that any TOML byte stream is to be interpreted as a valid UTF-8 byte stream and that, as such, only valid code units should be considered, and invalid code units should be, as per the Unicode spec, be considered as U+FFFD, or raise an exception.

Though the spec already says "A TOML file must be a valid UTF-8 encoded Unicode document", I'm pretty sure that a little further explanation cannot hurt. After all, not everybody will be able to hear @abelbraaksma's explanations or read through all the Unicode specs!

So it might be a good idea to add a second sentence along the lines of:

Specifically this means that, should a file contain byte sequences that cannot be decoded as valid Unicode code points (such as code units in the Surrogates Area), the file must be rejected or these byte sequences must be replaced with � (U+FFFD), as per the Unicode spec.

Though I don't think "valid Unicode code point" is a well-defined term? So maybe it's better to talk about "Unicode Scalar Values", which is defined in the Unicode spec?

abelbraaksma commented 1 year ago

Though I don't think "valid Unicode code point" is a well-defined term? So maybe it's better to talk about "Unicode Scalar Values", which is defined in the Unicode spec?

Scalar values have no direct meaning from the point of view of a UTF-8 encoded document. They are merely a scalar representation of a code-point, specifically introduced to aid in defining escape sequences.

Yes, terminology is tricky. The surrogate code-points are certainly valid code-points, after all, they are defined. However, they have no representation, so they cannot and must not exist in a UTF-8 byte stream (as explained at lengths above).

Another way of reading "valid" in this context is that, because they cannot be represented, that they are invalid, but I feel that that's requiring a little too much interpretation on the reader.

Note that different libraries use the word "code-point" differently and it depends on contexts. For instance in .NET, when normalizing, it can throw an error "invalid code-point". However, this is contextual: in the realm of normalizing a string, unassigned code-points may be treated as invalid. The UTF-8 encoding would only throw an exception on encountering invalid byte sequences (which includes the surrogate pairs).

Also invalid could, for instance, variation sequences be. But we should not concern ourselves with that. Even with invalid variations, the string would still be a valid UTF-8 byte sequence.

We could link to the term Well-formed code-unit sequence from the spec, if we want to be precise.

For fun, Unicode itself is quite precise. So much so, that they use the following definition for Surrogate character:

Surrogate character: A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term.

Anyway, TLDR etc etc, what about this?

Specifically this means that, should a file as a whole not form a well-formed code-unit sequence, as described in detail here, (such as the byte sequence containing code-units attempting to represent Surrogate code points or that can otherwise not be validly decoded) the file must be rejected or these byte sequences must be replaced with � (U+FFFD), as per the Unicode spec.

abelbraaksma commented 1 year ago

Also for fun, in case anyone thought CR, or LS was entirely dead, this is what Visual Studio offers when a file has inconsistent line-endings (and yes, this was a TOML file that got warbled from going back and forth between Windows and Mac machines).

And no, I'm not suggesting we should suddenly start supporting LS or PS, which would be rather exotic.

eksortso commented 1 year ago

It took me a long time to compose a response to recent messages. And by the time I was able to post my response, another handful of responses came through. So I'm breaking my response up into smaller chunks.

@ChristianSi wrote:

@abelbraaksma: Thanks for the detailed explanations!

Thanks as well @abelbraaksma! That was a very deep dive into Unicode terminology and usage. Somewhat distracting, but enlightening. We'll have to take what we learned from this and express it as laconically as we can within the spec to our less technically-oriented audience. (I don't do laconic well, but I try!)

I'm using the Unicode Glossary for my interpretations of the terminology that we're discussing. I got Joel Spolsky to thank for my early Unicode clarity, and I've forgotten most of it, it seems!

@abelbraaksma wrote:

And perhaps we should even add a Note to the ABNF (it's quite common to have this type of notes in specs), explaining that any TOML byte stream is to be interpreted as a valid UTF-8 byte stream and that, as such, only valid code units should be considered, and invalid code units should be, as per the Unicode spec, be considered as U+FFFD, or raise an exception.

The ABNF already contains the following notice. Its use of "Unicode Code Point" is still acceptable; all the code points encountered after successful parsing would be scalar values.

;; Although a TOML document must be valid UTF-8, this grammar refers to the
;; Unicode Code Points you get after you decode the UTF-8 input.

But it doesn't say what to do with invalid code points or incorrect byte streams. So let's add a version of @ChristianSi's response to toml.md. I have something written up. But let me review the last few responses first.

eksortso commented 1 year ago

From our perspective, injecting U+FFFD to replace bad byte sequences or code points is not a good idea. For many years, we have held implicitly with notable exceptions, that the proper response to bad TOML is not warnings or substitutions but errors, and as this has taken root, we have come to believe that the more helpful and descriptive the error messages are, the better. That ought to be our standard operating procedure, and we should say so in no uncertain terms.

The spec routinely refers to errors being produced in exceptional situations, but we see the word throw written in a few places. It may be worth making our own terminology consistent, going forward.

eksortso commented 1 year ago

We are drifting far away from the original purpose of this PR. Let's get this done, and get back on track.

I took @ChristianSi's recommendation into account and revised the Spec section of toml.md. I moved the case-sensitive condition down and made the expected format of the document its own paragraph. It also emphasizes that errors must be produced if the document isn't valid UTF-8.

Here is the proposal. What do you think?

Spec

A TOML file must be a valid UTF-8 encoded Unicode document. If a TOML parser encounters an invalid byte sequence, or any code point not valid in UTF-8 (such as a surrogate), then it must reject the document and produce an error.

TOML is case-sensitive.
Whitespace means tab (U+0009) or space (U+0020).
Newline means LF (U+000A) or CRLF (U+000D U+000A).

abelbraaksma commented 1 year ago

From our perspective, injecting U+FFFD to replace bad byte sequences or code points is not a good idea. For many years, we have held implicitly with https://github.com/toml-lang/toml/issues/538#issuecomment-397857741, that the proper response to bad TOML is not warnings or substitutions but errors, [...]

Yes, but we are not talking about TOML here. We are talking about UTF-8 encoding. Even if you tried, you could not prevent this behavior if your language of choice just reads a file as a string or UTF-8 stream.

It's the standard before the standard. Just like we don't define how TOML is saved on disk or in memory, we don't define how bad UTF-8 is treated, as that is already part of UTF-8 itself.

Not even XML, HTML or any other standard depending on UTF-8 says how the encoding/decoding process should take place. We just say what encoding TOML should be in.

We should just say: it should be valid UTF-8. Period. I tried to capture that in my last paragraph (the suggested wording there). The explicit mention there on surrogate pairs is only in reference to the Unicode specification on UTF-8. It doesn't change how, for about two and a half decades, UTF-8 encoding/decoding has been done (well, for conforming implementations, that is).

Try to load invalid UTF-8 in (almost) any editor. You'll get the U+FFFD character. We can try to force different behavior, but I think we should just go with the flow here and stick with accepted practices.

Also: talking about clarity of errors and the like: I don't think an encoding error is going to be clear for anybody. I.e., he or she would open the file in an editor and there's no error anymore. If it comes to clarity, I think just sticking to what Unicode itself says, and referring to that (and possibly repeating it) is really the best we can do.

abelbraaksma commented 1 year ago

PS: in the end I'm fine with whatever is going to be decided here. Don't take my point of view as gospel (I'm sure you won't, but just wanted to emphasize that). I tend to be a bit picky when it comes to standards, it's what you get when you spend too long in standardization committees. That's not a recommendation! 😆

eksortso commented 1 year ago

I tend to be a bit picky when it comes to standards, it's what you get when you spend too long in standardization committees. That's not a recommendation! 😆

@abelbraaksma Man, your insight is going to become so valuable when we start work on the RFC, per #870. We'll need to dot every I, cross every T, and hope we don't include Oxford commas. Or don't exclude them. Which are we doing? 😝

Point is, you're a big help, and we truly appreciate that here in our standard-setting microcosm! We have a lot of places in the specification where that attention to detail will make a vital difference.

eksortso commented 1 year ago

If we state that TOML should be (not must be) valid UTF-8, any problems that parsers may face from trying to mess with invalid file formats will be up to their maintainers to correct. But if we said "must," then the ABNF (and its assumption of successfully decoded text) would make a lot more sense. So I will keep the "must" in the valid UTF-8 requirement. I'll remove the rest. But I'll keep that line first.

Try to load invalid UTF-8 in (almost) any editor. You'll get the U+FFFD character. We can try to force different behavior, but I think we should just go with the flow here and stick with accepted practices.

Editors have their own behaviors when handling bad byte sequences. But if you had a parser that loaded a byte stream, it may not immediately decode it before starting to parse, and try to push their own behaviors if they encounter bad inputs. But on the other hand, I may be underestimating the processing power of modern embedded systems.

Also: talking about clarity of errors and the like: I don't think an encoding error is going to be clear for anybody. I.e., he or she would open the file in an editor and there's no error anymore. If it comes to clarity, I think just sticking to what Unicode itself says, and referring to that (and possibly repeating it) is really the best we can do.

True.

abelbraaksma commented 1 year ago

If we state that TOML should be (not must be) valid UTF-8

While not really OT, a typical, other way of phrasing this, is saying that the spec is based on valid Unicode and that at a minimum, UTF8 must be supported to be a compliant parser. Whether parsers allow a wider range of encodings is really up to the implementers and “outside the scope f this specification”. This already happens in practice anyway, and is a good thing.

You can still say that a valid, interchangeable file must be encoded in valid UTF-8.

But that’s probably something for another thread and discussion :).

eksortso commented 1 year ago

Here are my notes on the most recent commit.

In toml.abnf, I moved non-ascii back into the definition of allowed-comment-chars. This way, we avoid mentioning the surrogate code point range. We make that doubly clear.

In toml.md, I shifted up the UTF-8 document requirement up to its own line outside the bullet list in the Spec section. And I reworked the Comment description. Instead of naming the excluded code points, I describe why they're excluded and just list them.

Does the following language make sense?

A hash symbol marks the rest of the line as a comment, except when inside a string. Comments may contain any valid code points except a limited subset of ASCII control codes that could cause problems during editing or processing (specifically, U+0000, and U+000A to U+000D).

ChristianSi commented 1 year ago

Another little clean-up proposal while we're at it: the spec talks several times about "Unicode character(s)", but uses the alternative "UTF-8 character(s)" just once. I would propose to change the latter to "Unicode character" as well, as the specific mention of UTF-8 seems rather pointless here.

eksortso commented 1 year ago

Another little clean-up proposal while we're at it: the spec talks several times about "Unicode character(s)", but uses the alternative "UTF-8 character(s)" just once. I would propose to change the latter to "Unicode character" as well, as the specific mention of UTF-8 seems rather pointless here.

I replaced "UTF-8 character" with "Unicode character". And in the literal string section, I replaced the original phrase with "Thus, for binary data, it is recommended that you use Base64 or another suitable text encoding" (my emphasis). The word "text" replaced "ASCII or UTF-8". The context is binary-to-text encoding, so this is appropriate.

eksortso commented 1 year ago

Dang, that bot is good. I had [scalar values] split across separate lines, and the bot meshed those lines together. I didn't think there was a difference.

Anyway, what do you think of the current state of the PR?

ChristianSi commented 1 year ago

@eksortso:

Dang, that bot is good. I had [scalar values] split across separate lines, and the bot meshed those lines together. I didn't think there was a difference.

There shouldn't be a difference, and I have no idea why the bot did that. Markdown allows links to include linebreaks. But if the bot prefers to keep them on a single line, sure why not...

ChristianSi commented 1 year ago

@abelbraaksma:

We might (emph on ‘might’) specify the range of scalar values in an informational sentence, similarly how you did that above.

I don't think that's necessary, considering that the definition is just one mouse click away. But I wouldn't be opposed either.

eksortso commented 1 year ago

Thanks @ChristianSi @abelbraaksma for the approvals, and with all of your help!

@pradyunsg @mojombo Can you have a look at this now? Is it good to merge?

pradyunsg commented 1 year ago

There's quite a few more changes here than the PR title suggested; I'll try and take a look in the coming weeks.

eksortso commented 1 year ago

@pradyunsg Could we split the changes into two separate PRs? The Unicode language changed a lot. Suppose we could isolate those changes and address them separately?

UPDATE: Already isolated those changes. See #929. If that's merged, then this PR will be greatly simplified.

eksortso commented 1 year ago

I stripped this PR back down to its essentials, that is, the re-allowance of most control characters inside comments. Anything pertaining to Unicode language changes was already moved over to #929, which can be reviewed separately.

@ChristianSi @abelbraaksma Want to take another look?

@pradyunsg Can you review this PR again?

eksortso commented 1 year ago

@pradyunsg This is a lot simpler since the last time you looked at it. We're allowing all but five (UTF-8-valid) code points in comments. Is this good to merge?

eksortso commented 1 year ago

@pradyunsg Could you please review this and make a decision? It's been two months since the last reminder.

pradyunsg commented 1 year ago

Thanks @eksortso! ^.^

abelbraaksma commented 1 year ago

Well done, great work!

toml-lang / toml