Clarify that key uniqueness depends only on binary representation, recommend normalization

SnoopJ commented 1 year ago

I've just learned about #891 and I'm excited to see that the TOML specification is improving Unicode support.

Do I understand right that this changeset makes no recommendations for implementers when it comes to equivalence of keys? I see a note on normalization on the related issue, but if I understand the PR correctly, keys that are "equivalent" under one of the normalization forms of UAX#15 will be distinct under the specification unless their binary representations are identical.

That note suggests that a warning/suggestion for implementers might be added to the spec, but it looks like that never happened. This issue is a request that such a note be added to at least make implementers (and users of parsers that don't bother with normalization) aware of the potential confusion of keys, as in the example of ñaña (NFC form, 6 bytes in UTF-8) vs. ñaña (NFKD form, 8 bytes in UTF-8).

pradyunsg commented 1 year ago

See also #954

eksortso commented 1 year ago

We did discuss key comparisons and issues regarding normalization in great detail in the discussion on #941. It's a very deep issue! In fact, some languages and tools tend to perform normalization implicitly.

@SnoopJ I personally like the idea that keys be normalized during comparisons, but we shouldn't make normalized or binary comparisons mandatory in all cases. In old-fashioned parlance, that's not a MUST nor a MAY, but a SHOULD.

During that past discussion, I suggested something like the following be added to toml.md, though this is different from the original in that I heavily favored NFC in my original draft:

Because some keys with different code points look the same, parsers should compare keys using a specific normalized form of the keys, rather than just using binary comparisons.

What are your thoughts on this text? Could we make the point more clear?

And also, what about including this line for TOML emitters?

Likewise, encoders should write keys using a specific normalized form.

Thoughts?

ChristianSi commented 1 year ago

@eksortso: I like the sentence about parsers, but instead of "a specific normalized form" I'd write, "Normalization Form C (NFC) or Normalization Form D (NFD)". The NFK... forms normalize too aggressively (equating … with ..., say), hence they must not be used!

I'm less sure the sentence about encoders is needed, but I guess it makes sense.

marzer commented 1 year ago

I remain opposed to recommending any specific normalisation in the spec for all of the reasons I outlaid in #941 (none of which were satisfactorily rebutted IMO). The biggest one is still the problem of normalisation requiring a third-party dependency for low-level languages like C and C++; that's a non-starter automatically for most users when choosing a library. Requiring users to "just link with ICU" is not a solution to this problem because it still complicates their toolchain and is not necessarily always available.

And again I'd like to note some caution around the RFC-like "SHOULD" parlance - we either use the RFC terminology, in which case we must state so clearly in the spec document, or we use plain English. The word "should" carries quite different connotations between the two contexts. Whichever we choose, it must be consistent throughout the whole document.

My end-game fear is that if some form of normalisation is recommended in the spec, and that normalisation is tested in the TOML-test suite, then implicitly it becomes a requirement. The simplest and most portable path forward is to do as OP is suggesting to simply clarify that key comparison is ordinal, buyer beware.

eksortso commented 1 year ago

The NFK... forms normalize too aggressively… hence they must not be used!

@ChristianSi I take it you've got some war stories to tell about that? How are NFKD and NFKC, as defined by Unicode, generally received by people who look at them for normalizing identifiers? What consequences did they face when they chose to use these forms?

Unless convinced otherwise, I say that the specific normalization form chosen is not our choice to make, except that whatever choice is made ought to be standardized. Maybe there's another strict standard I've overlooked that would work just as well. Or, UTF-8 bytes could be the standard, which would still be allowed, even if it's not recommended.

@marzer I'll review your arguments later, and do note that my "old-fashioned parlance" comment was written with your past comments in mind! ;) But to my English-trained ears, the word "should" sounds like the preface to a recommendation, not to a constraint. It invites the question "Well why should we??" from punks who don't want to be constrained without reason.

(Edited. The newer question sounds more natural.)

marzer commented 1 year ago

But to my English-trained ears, the word "should" sounds like the preface to a recommendation, not to a constraint. It invites the question "Why not?" from punks who don't want to be constrained without reason.

Interesting; to me "should" sounds much too strong for something that is a mere recommendation. Wonder if it's a dialectical thing. In any case, I am fearful of any normalisation ending up in the test suite because I won't be implementing it in TOML++ any time soon and am sceptical that it actually solves a real problem.

ChristianSi commented 1 year ago

@eksortso Whoever gave you the impression that all normalization forms should work equally well? That's not how it works. The Unicode people themselves write in their Unicode Normalization FAQ:

Programs should always compare canonical-equivalent Unicode strings as equal.... One of the easiest ways to do this is to use a normalized form for the strings: if strings are transformed into their normalized forms, then canonical-equivalent ones will also have precisely the same binary representation. The Unicode Standard provides two well-defined normalization forms that can be used for this: NFC and NFD.

See how only the two shorter-named forms are listed here?

Go they on:

For loose matching, programs may want to use the normalization forms NFKC and NFKD, which remove compatibility distinctions. These two latter normalization forms, however, do lose information and are thus most appropriate for a restricted domain such as identifiers.

Clearly, TOML keys are not a "restricted domain" since quoted keys can contain arbitrary characters. So these lossy forms (which treat merely similar strings as if they were identical) are not appropriate for our use case.

eksortso commented 1 year ago

@eksortso Whoever gave you the impression that all normalization forms should work equally well? That's not how it works.

@ChristianSi I never made that claim. Don't put words in my mouth. What I said was, it's not our position to say what normalization ought to be used. My restriction was that it must be standardized, so that its behavior is predictable. Maybe NFKC and NFKD are bad because they're overly aggressive, and they obviously would be, for strings that aren't identifier types like keys and table names!

eksortso commented 1 year ago

I am fearful of any normalisation ending up in the test suite because I won't be implementing it in TOML++ any time soon and am sceptical that it actually solves a real problem.

@marzer, well, there are several places in toml.md right now that use the words "should" and "recommended." Scan through those instances, and check BurntSushi/toml-test to see if these situations are actually tested. And if they are, take it up with the folks who run the test suite.

You may need to do the same thing if normalization gets imposed. And I too don't want it forced, but I do want it recommended, in the most civil manner possible.

marzer commented 1 year ago

there are several places in toml.md right now that use the words "should" and "recommended."

I don't see why it can't be a "may", then? Why "should"? I'm OK with the word "recommended" because that is explicitly a recommendation (i.e. not a stipulation), and sounds similar to using "may". The word "should" can easily be interpreted as being stronger than that, as in "this is how it should be".

Aside from that, I haven't yet seen a satisfying argument for how this is actually any better than doing a simple ordinal comparison, frankly. Seems like an awful lot of intellectual chest-puffing without actually solving a real, extant problem. Again: Why not just say "key comparisons are ordinal, exercise caution"? It's a config language meant to be consumed in technical domains, after all, not a written language. There's already a contract between those writing the configs and those consuming them; surely unicode normalization concerns will be domain-specific. (And easily solved at the text-editor level where necessary?)

marzer commented 1 year ago

And if they are, take it up with the folks who run the test suite.

This is the wrong attitude. Language specs usually don't exist in a vacuum; TOML is no exception. Implementation concerns should be taken into account at the spec level, not at the test-suite level. Seeing as I'm the only implementer who regularly contributes to discussions here, I have to bang that drum.

well, I don't have to, but an implementer needs to contribute to discussions here. Otherwise it becomes design by academia/committee.

arp242 commented 1 year ago

there are several places in toml.md right now that use the words "should" and "recommended." Scan through those instances, and check BurntSushi/toml-test to see if these situations are actually tested. And if they are, take it up with the folks who run the test suite.

I'm not sure if "should" and "recommended" behaviours are tested from the top of my head; but if they are then they're relatively minor issues which are usually fairly easy to address one way or the other. The whole Unicode issue is much more major. If "recommend Unicode normalisation" ends up in the specification I'll add something to make these tests optional (probably opt-in, rather than opt-out), similar to how the TOML 1.1 tests are opt-in now.

Other than that, I mostly agree with marzer's comments, adding that I feel we can side-step the worst of the entire issue by as I outlined in #954. I also wouldn't be in favour of "should" or "recommended" language for something like this: it sounds like a compromise that's "fair" in the sense that it leaves everyone equally unhappy. It's too large of an issue to leave unspecified.

This issue has been debated a few times now in different threads over the last few months, and I don't really feel like doing it again. At this point it's fair to say the discussion is at a stalemate, and I don't really know how to advance it to reach a consensus.

eksortso commented 1 year ago

Maybe we actually can strike a balance. We don't want to say so much that it leaves us straitjacketed in the future. But we can point out some things that aren't immediately obvious, because they do need said!

We can just use a single paragraph, at the bottom of the Keys section, just above the Strings section:

Because some keys with different code points look the same, use caution when writing such keys in your TOML documents. Applications and parsers may use NFC or NFD to normalize keys before making comparisons so that canonically equivalent key names are considered the same. Nearly all programming languages have tools to normalize keys, in case implementers wish to do so.

This one paragraph has no SHOULDs in it @marzer, makes note of the issue without enforcing a means of fixing it @arp242, advises using NFC or NFD to make comparisons @ChristianSi, and provides links to resources for those who'd want to delve deeper into the issues. That's all we would need to say. For now, at least.

SnoopJ commented 1 year ago

I should probably have worded my initial request a little more clearly, since it's actually asking two related questions:

1) Should the TOML spec point out the pit-fall? 2) Should the TOML spec make any specific recommendations for implementers?

To me, (1) is an easy add. I personally am not heavily invested in the language used, so long as it warns users and implementers of the potential ambiguity.

It seems that (2) is a more contentious matter, and I don't have much to add to the preceding discussion, except to point out that TOML keys don't appear to meet the standard of "identifier" laid out by TR-31 (specifically because UAX31-R1 is not being satisfied by explicit choice of a "profile").

My main concern with (2) is end-users becoming sensitive to the details of whatever implementation(s) their data may pass through, and even a suggested normalization does not resolve this tension. Perhaps the best thing to do is to go in the other direction and require that implementions SHALL NOT normalize, i.e. double down on the binary representation and the users living with the edge cases just need to deal with that in their own applications?

I think y'all have a better sense of (2) than I do, but looking back, I think I should have filed this as two issues :sweat_smile:

eksortso commented 1 year ago

@SnoopJ Would it suffice, for (2), to only permit normalization to be applied for key and table name comparisons? Some implementers may prefer to keep the normalized key names in memory, to ease key lookups post-processing. But we would never require string values to be normalized, so the data stored in values will be preserved. I made no mention of normalizing string values (as opposed to key and table names) anywhere in the proposed paragraph. Some languages may normalize strings automatically, but those implicit actions fall outside of this specification, and may actually hamper efforts to enforce binary representation.

SnoopJ commented 1 year ago

@SnoopJ Would it suffice, for (2), to only permit normalization to be applied for key and table name comparisons?

That does sound like it restricts the potential confusion to comparisons, rather than confusion about the data itself, which I think establishes the kind of "here be dragons" guide-rails that I had in mind when filing this issue. I'm afraid I don't know enough about implementations to have a particularly authoritative opinion, but the scope of the recommendation is definitely smaller in that case.

marzer commented 1 year ago

@eksortso The paragraph you propose is much better, but you still haven't addressed why we're leaving the door open for any normalisation at all. Why do you want this?

The more I think about it, the more I realise there's really only two paths forward: It should either be a requirement with a specific algorithm, or we should explicitly rule it out with a note that we do ordinal (binary) comparison. Anything else will fragment the ecosystem (as @SnoopJ points out above).

And given my (still unaddressed) objections, ordinal is the only option. It is impractical for me to implement anything else.

Leaving room for normalisation is more and more seeming like a pet idea with no good reasoning behind it, and should be abandoned altogether, IMO.

Would it suffice, for (2), to only permit normalization to be applied for key and table name comparisons?

There is no good reason to do this in a technical domain, especially given the existing contract between config author and consumer.

marzer commented 1 year ago

Perhaps the best thing to do is to go in the other direction and require that implementions SHALL NOT normalize, i.e. double down on the binary representation and the users living with the edge cases just need to deal with that in their own applications?

This is the correct approach.

marzer commented 1 year ago

Something else nobody seems to have considered: if we leave the door open for normalised comparison of keys, we're closing a door for applications that may actually depend on them being compared ordinally. Developers can add normalization if they need it, but they can't take it away if we do it and they don't want it. Actively taking away the agency of developers by being 'too clever' tends to piss them off and send them elsewhere (see: YAML). This will become even worse if it's optional, because now they have to audition various implementations to find one that doesn't do the annoying thing they are trying to avoid, at the potential cost of picking one that is flawed/non-compliant in other ways. Fragmentation.

I can easily envision a scenario where someone would use TOML to configure some text parsing application (e.g. spam/swear filter, linter, regex) and they will likely wish to have normalisation be entirely handled by their application for more fine-grained control. We should not inadvertently block this workflow.

@ChristianSi, thumbs-down me all you like, but the fact remains that ruling-out normalisation in key comparison is the only portable and flexible path forward. TOML is not the right level at which to solve this problem.

eksortso commented 1 year ago

@marzer, you say "SHALL NOT" normalize is the best approach, which would be a MUST, which would be a REQUIREMENT, which would run counter to the obviousness principle, since a few popular languages normalize their strings by default. You got two thumbs-down for a sound reason.

TOML is not a binary format. We shouldn't force it to be.

And if programmers want to go against default behaviors by forcing ordinal string comparisons on key names and giving their users a bad time, then the onus is on them to communicate their wayward intentions. That's not our problem, and that shouldn't be our testers' problem either.

And we're fighting over something that is the edgiest of edge cases! This is as painful as the internet gets. ñaña is ñaña, and that is a rare but explainable problem to have. Let's save our sanity and keep this in mind.

marzer commented 1 year ago

Sorry, but I disagree for all the reasons I've already outlaid. Nothing you've said here satisfactorily counters my concerns. By doing this we're shutting out valid implementation concerns for purely ideological reasons that make no sense in a highly technical domain. I put it to you that you're the one trying to introduce an unexpected 'bad time', because your pet idea is inappropriate in this context. The rest of your argument is just facile wordplay.

And again, you still haven't explained why you're advocating so strongly for normalisation. As far as I can tell, you have no reason for it beyond "it is warm and fuzzy". Please. Why do you want this? It's a bad idea.

You got two thumbs-down for a sound reason.

By two people who aren't maintainers of TOML implementations. In this context I value those opinions very little. I also got a sound endorsement from someone who is, and that's far more meaningful to me.

And if programmers want to go against default behaviors

We decide the default. We pick a developer-hostile one at our peril.

This is as painful as the internet gets. ñaña is ñaña, and that is a rare but explainable problem to have.

Nothing that can't be solved by saying "exercise caution when using non-ascii keys", which is far simpler than an arbitrary and 'optional' solution to someone's pet issue.

marzer commented 1 year ago

Ultimately, the onus is on you counter this:

Developers can add normalization if they need it, but they can't take it away if we do it and they don't want it. [...] I can easily envision a scenario where someone would use TOML to configure some text parsing application (e.g. spam/swear filter, linter, regex) and they will likely wish to have normalisation be entirely handled by their application for more fine-grained control. We should not inadvertently block this workflow.

I submit that you can't, and thus the entire foundation of your argument is flawed.

arp242 commented 1 year ago

The biggest problem I have is that in some environments the cost is too high. You can implement TOML 1.0 fairly easily without too much code. With required normalisation I'd have to import >30M dependencies for a ~4,000 line TOML library which currently has no dependencies at all. It's not a show-stopper, but I'm not too happy with that balance either as a TOML implementer or user (as in, user of my TOML library; i.e. application developer).

It was decided to have the current Unicode range in bare keys so a Unicode library/database isn't required for a TOML implementation, and now people are argueing normalisation (and thus a Unicode library/database) must be required. Seems rather inconsistent. Either go all-in on Unicode all the way or don't, but in-between has the disadvantages of both.

We can't implement normalisation for all keys as it's not backwards-compatible: we can only do it for bare keys. Unicode normalisation applied to quoted keys would make TOML 1.1 behave different from TOML 1.0. "Unicode normalisation must be applied for bare keys only, but not quoted keys" seems confusing and generally horrible. I don't recall if this was previously brought up (IIRC it wasn't), but this certainly needs careful consideration.

My favourite solution remains rewriting the specification so that normalisation simply isn't needed; the combining tilde in ñaña isn't something that needs to be allowed in bare keys. This makes the entire issue go away.

marzer commented 1 year ago

Unicode normalisation applied to quoted keys would make TOML 1.1 behave different from TOML 1.0. "Unicode normalisation must be applied for bare keys only, but not quoted keys" seems confusing and generally horrible.

This was a concern I raised in #941. It wasn't addressed there, either.

eksortso commented 1 year ago

This is certainly no pet project of mine. If you want to enforce binary equivalence, you go open a PR and fight for it yourself.

I'll stick with this.

Because some keys with different code points look the same, use caution when writing such keys in your TOML documents.

I'm through with this particular topic. Short of libelous claims made against me, I'm keeping quiet. Write your own PR.

marzer commented 1 year ago

I'm fine with that sentence. It doesn't mention normalisation at all, and to any reasonable interpretation implies that keys are compared 'simply' (i.e. ordinally).

libelous claims

Lol. Come on. I asked you to justify why you wanted normalised comparison a number of times and you never answered; don't be surprised if I form opinions in the absence of an answer. It's not an unreasonable question, especially in the face of overwhelming counter-arguments.

eksortso commented 1 year ago

Lol. Come on. I asked you to justify why you wanted normalised comparison a number of times and you never answered

Normalization is going to happen in some cases, even if we don't ask for it. Not in C++ or Python. But whether it does, that requires a knowledge of the implementation language's default behavior during string comparisons.

@marzer You may need to restate some of your "overwhelming counter-arguments."

arp242 commented 1 year ago

All these discussions remind me of R6RS Scheme standard, which was quite controversial partly due to its Unicode requirement, and this was never really resolved after countless discussions. Scheme solved this with R7RS by splitting things in to a "small" and "large" version, but that's not really an option for TOML.

I'm not really expecting a resolution here too; maybe we should consider to just back-out the whole Unicode bare keys thing, release TOML 1.1, and leave it for a later version when we have a solution everyone can live with. Right now this is preventing us from releasing other useful improvements. Other than this TOML 1.1 is essentially ready.

ChristianSi commented 1 year ago

I'll comment longer tomorrow or so, but for now, everybody please remember that this has nothing whatsoever to do with bare keys and what's allowed in them. This whole issue applies in exactly the same way to quoted as to bare keys.

And quoted keys allowing arbitrary Unicode have been part of TOML since it's very beginnings. I don't think anyone would seriously suggest to remove them?

arp242 commented 1 year ago

please remember that this has nothing whatsoever to do with bare keys and what's allowed in them. This whole issue applies in exactly the same way to quoted as to bare keys.

That is correct, but few people use quoted keys so non-ASCII in keys is fairly rare right now. With bare keys these issues become more pronounced as people will actually start using non-ASCII in keys. I do think these issues are strongly related and need to be considered together. It's not a coincidence that we've only started discussing this after the Unicode in bare keys change was made.

eksortso commented 1 year ago

Looks like I'm stuck in this! 😅

Many people now are looking forward to the new unquoted keys. They may not forgive us for backing out. And although it's not technically an issue about quoted vs. unquoted keys, it's unquoted keys that are paving this figurative road to hell for us.

I naively claimed that we could just tell folks that troublesome keys need to be avoided. But how familiar are normal users with this issue? Scripts other than Latin-inspired ones deal with these issues, maybe more than any of us ever had, even among the implementers among us. If we avoid bringing up normalization completely, then the promise of greater Unicode acceptance in key names will make the TOML standard less human-friendly in those parts of the world.

We need a lot more people using scripts that we're unfamiliar with to chime in, to see what trying to sidestep the normalization issue would cause. And of course, they'll need to communicate what they know to us, and we are unfortunately bound to discussions in English at the very least. We need more "us" among us.

I'm starting to understand why @ChristianSi wants to make normalization a requirement. But because resource-limited systems need slimmer dependencies, it makes sense to permit ordinal comparisons. I don't want to ditch the embedded software development community, so short of a miracle, we can't make normalization a requirement.

Implementers would be free to just not implement normalization, try to avoid the issue entirely. That may not help! Our "configuration standard for human beings" may need normalization so that others can utilize TOML to the fullest.

@arp242 Do you know anyone who dealt with the issues in Scheme first-hand who you could introduce us to, so they can share their war stories? Who else has had to deal with the consequences of enforcing binary comparisons?

ChristianSi commented 1 year ago

Though I had argued for normalization as requirement, I could now live with the compromise proposal to merely mention it as an option.

In fact I must admit that I have some trouble understanding why some people here suddenly seem to be even against that compromise? Making it required would impose a heavy burden on some implementations, understood. But surely, leaving it as merely an option won't hurt them, or anyone else?

arp242 commented 1 year ago

@arp242 Do you know anyone who dealt with the issues in Scheme first-hand who you could introduce us to, so they can share their war stories? Who else has had to deal with the consequences of enforcing binary comparisons?

Sorry, no. I haven't been involved in a long time, and my involvement was only as a spectator from the sidelines.

Many people now are looking forward to the new unquoted keys. They may not forgive us for backing out.

"May not forgive us" is probably too strong. I'm not proposing we never implement this, but rather delay it until all issues are sorted. And people are looking forward to other changes too (newlines in inline tables probably being the most prominent). If this change would be (temporarily) backed out, then TOML 1.1 is ready to be released AFAIK. Nothing prevents us from releasing TOML 1.2, 1.3, etc. Actually reaching a consensus would be better, but I'm not very confident we will in the short term, and I'd rather see TOML move forward rather than being hung up on this.

ChristianSi commented 1 year ago

Hmm, we have heard from TOML implementers that they're against Unicode normalization of keys. Maybe it's time to ask the other way around: Assuming the spec would allow (but not require) normalization, are there any TOML implementations that would perform such normalization? Or are there any that do it already?

If yes, I'd consider that a clear argument in favor of allowing normalization, as suggested by @eksortso .

If not, on the other hand, then the whole issue seems a bit moot (what's the point of a permission that's not used?), so in that case we might as well go on and require byte-for-byte comparison instead.

eksortso commented 1 year ago

Revisiting this, and trying to clear away any sticking points stuck in by bruised egos, including my own. To this end, I won't refer to anyone by name or handle. I was doing so before because I thought it was only fair. But it may be dividing us. Let me gather my recent thoughts. If I'm repeating your argument without mentioning you, or countering an argument you previously stated, please don't consider it a slight.

That said, I now think allowing normalization in the parser is a mistake. The only advantage we get is preventing duplicate keys with different code point sequences. How often does this happen? I asked if anyone with experience outside of English deals with this as a regularly occurring problem. Six months with no response suggests it's not a pressing issue and not a regular occurrence. Unless we haven't connected with a non-English-speaking audience, and we know that isn't true.

Now coming at it from the other side: Should we enforce ordinal comparisons? Which programming languages don't do this by default? Database collations were mentioned. But the fact is, for parser writers, no string comparisons should ever rely on defaults. So I am currently in favor of enforcing ordinal key comparisons.

Who would object to this becoming a new rule? Where would this break backwards compatibility with TOML 1.0.0? It wouldn't. It would make valid the sorts of tricky obfuscated keys that Françoise and Françoise would object to. But the users who would do this should bear its burden, not the parser writers.

We ought to still warn against using similar key names. But we could state this after asserting that keys in tables must use ordinal comparisons.

Here's a new proposal:

Keys are identical if their code point sequences are the same. Since some strings may look the same but use different code points, be careful when using such strings as keys.

# CAUTION:

# prénom = "Françoise", using NFC
"pr\u00e9nom" = "Françoise"

# prénom = "Françoise", using NFD
"pr\u0065\u0301nom" = "Françoise"

erbsland-dev commented 1 year ago

I appreciate the revised proposal. Just like programming language specifications, which afford compiler developers varying degrees of freedom in implementing certain features, I'd suggest phrasing it as follows:

Keys are considered equal if their code point sequences are identical. If a TOML parser cannot perform comparisons based on code point sequences or if it implicitly normalizes Unicode text for keys, this behavior must be explicitly documented.

For specialized use-cases where Unicode characters in keys must be strictly identical, even across different composition formats, the responsibility for normalizing the Unicode strings falls upon the user of the TOML parser, both in the read configuration and when interacting with its API.

In such special cases where normalization is critical, the TOML configuration should be normalized prior to parsing, ensuring that all keys adhere to a consistent format when compared by code points.

In the rare event that a parser is unable to compare strings by code points due to limitations in the programming language it is implemented in, the parser's documentation must include a clear warning outlining this behavior.

This approach strikes a balance, providing a universally applicable standard while clearly delineating responsibilities for both parser developers and end-users.

(I'm btw. an international user, using German and French in customer applications where composition can be an issue in comparison äàéö..., yet I would never use unicode characters in keys of configuration files.)

SnoopJ commented 1 year ago

Should we enforce ordinal comparisons? Which programming languages don't do this by default?

If you're asking about comparison of identifiers, I think the answer to this is at least most of the languages that implement UAX#31, which requires the choice of a normalization form. Practical examples are Python and C++ (after C++23). For example, in Python, your sample keys are the same identifier:

>>> prénom = "Françoise"  # NFC form
>>> print(prénom)         # NFD form
Françoise

If you're talking about comparison of strings, I am not aware of any that implicitly normalize, the ones I'm aware of do ordinal comparison of the codepoint sequences when it comes to string data.

I think part of the ambivalence I felt about the preceding discussion is caused by not being sure whether TOML keys are more like identifiers or more like strings. When I asked about a recommendation for normalization when filing this issue, I was thinking of them as more like identifiers, but looking over it again, I am now leaning towards a string-y interpretation.

But the users who would do this should bear its burden, not the parser writers. We ought to still warn against using similar key names. But we could state this after asserting that keys in tables must use ordinal comparisons.

This seems like a reasonable compromise that balances 'typical' usage and maintaining compatibility against including a warning that key confusion (which is probably uncommon) can happen. +1 from me.

ChristianSi commented 1 year ago

@eksortso: I like and support your proposal.

I think normalization is indeed not a pressing issue at least for the major European languages (I don't really know about others). In German, non-ASCII characters commonly occur in words such as Äpfel, Füße, Öl etc., but I think that all usual program write them using the one-codepoint representation; two-codepoint alternatives are allowed, but it would suspect them to be very rare. So implicitly word processors, editors etc. already do NFC normalization, meaning TOML parsers don't have to. And I suppose it's mostly the same in other languages using the Latin script.

In general, I think that data formats will pursue the same route and not expect normalization. One I checked is the JSON standard (ECMA-404), which says: "A name is a string." And: "A string is a sequence of Unicode code points". So, if the sequence of code points differs, it's a different string/key, even if it looks the same. Sounds reasonable.

Python also doesn't normalize strings implicitly, for example "Françoise" == "Françoise" evaluates to False (first uses NFC, second NFD). Neither will most other programming languages, I suspect.

So, let's not do it either.

arp242 commented 1 year ago

As mentioned, the big question is "are TOML keys identifiers or strings?" – I'm firmly in the "identifier" camp on this; that some (but hardly all) languages parse tables to a hashmap with string keys is an implementation detail.

Almost all language either 1) apply normalisation to identifiers, or 2) don't have these issues because combining characters aren't allowed in identifiers. Just like APFS, they don't do that for the fun of it or because people were bored.

People mentioned German, but I consider German to be "too easy" to serve as an example here; it's basically just Latin + umlaut on a very limited set of letters + a few other diacritics on some loanwords. It's not that different from English really. This is the case for most European Latin-based languages, and I consider those all "too easy"; what about Vietnamese? Greek? Korean? Bengali? Arabic? Other scripts?

Judging from One Stack Overflow answer at least Vietnamese outputs non-NFC on Windows, although that's from 2011 so who knows if that's still accurate.

In short:

I'm not sure if "no one replied so let's just hope for the best" is a good path. Actually, I'm pretty sure it's not – not many people read this.
To really have an informed opinion on all of this, real-world experience and knowledge with different systems and scripts seems required, rather than just reading some specification document on unicode.org. None of us seems to have this.

marzer commented 1 year ago

I believe @eksortso's proposed wording change is the right move.

As mentioned, the big question is "are TOML keys identifiers or strings?" – I'm firmly in the "identifier" camp on this

Treating them as identifiers and expecting them to be normalized accordingly implies a number of significant problems:

If we enforce any sort of normalization, we are dooming TOML in lower-level contexts where implementers can't rely on their language, OS or local install environment having the necessary machinery. Existing third-party solutions are enormous, and correctly implementing it manually is no mean feat.
Imposing normalization precludes situations where people may explicitly not want that. Developers can apply normalization to keys after-the-fact if they need it, but they can't take it away if it's done for them and they don't want it, since it's a lossy operation.
If we normalize bare keys, do we extend this to quoted keys? If so, then we're effectively changing the contents of a quoted string, which is counter-intuitive. If not, we're still stuck with the same problem of visually identical strings sometimes comparing inequal, only now one of them has quotes around it, which isn't any better.

Whereas treating them as strings and always comparing them ordinally means:

Sometimes keys that look the same compare differently? Big whoop. We have that problem now with quoted keys. At least if we enforce ordinal comparison, the behaviour will be uniform and predictable everywhere, rather than 'implementer's choice'.

All the C and C++ implementations I'm aware of do ordinal comparsion of keys. Higher-level language implementations seem to be a bit of a mixed bag, depending on what's available in their language (since they typically have the luxury to choose), but even they are going to be at the mercy of whatever normalization algorithm their language and/or standard library make available, and how current the underlying implementation actually is.

There also has been a temptation to make key normalization optional, but that just means we have a fragmented ecosystem. Nothing in the spec should be optional, IMO. When a change is controversial enough that people are even tempted to say the "O" word, that change should be a non-starter.

Thus, amending the spec to explicitly clarify that "yes, keys are treated as strings, and yes, they are compared ordinally" is the only portable and robust path forward.

ChristianSi commented 12 months ago

@arp242: Why are you in the "identifier" camp? I'd rather say keys are like keys in a hashmap, since that's precisely TOML's conceptual model ("Tables (also known as hash tables or dictionaries) are collections of key/value pairs", says the spec). However, few (none?) languages seem to normalize keys in such hashmaps, as we've noticed.

As for normalization requirements of different languages: yes, maybe normalization is indeed necessary to make keys in (say) Vietnamese or Arabic work well, who knows? None of us, it seems. Still, I think there are ways around that. One would be a preprocessing step: "If your keys are in a language where a lack of normalization is a problem, please make sure to only pass properly normalized input to your TOML parser."

Another would be to weaken @eksortso's proposal to make binary comparison the default behavior, but allow parsers to optionally normalize if requested by the user to do so. I would approve that, as I think if we have sane defaults than configurable deviations from that defaults are entirely acceptable.

arp242 commented 12 months ago

Keys don't need to be decoded to hashmaps; at least in Rust and Go it's common to unserialize to a struct, and this probably applies to other languages as well. Maybe I'm biased because I spend most of my time with Go, but the typical pattern is:

type Doc struct {
    Key string
}
var d Doc
toml.Decode(`key = "value"`, &d)

You can scan to a map, but it's uncommon. This is the pattern I'd generally prefer in most languages to be honest (even e.g. Python or Ruby, by e.g. scanning class attributes, although I don't know what the current possibilities for this are, it's the type of thing I'd consider writing a patch or my own library for – I suppose this is part of the great "static vs. dynamic" debate).

I'm not a huge fan of the explicit mention of "hashmaps" in the spec (even though it's been there right from the start, this is one of the areas where TOML shows it roots in Ruby and dynamic typing, which isn't necessarily a good thing).

As I mentioned in some earlier comments, I'm not a fan of enforcing normalisation. I'm just pointing out that "🤷" is not the best way either. I think I have a solution I will submit as a proposal sometime in the next few days.

marzer commented 12 months ago

This is the pattern I'd generally prefer in most languages to be honest [...] e.g. scanning class attributes

This is competely impossible in some languages because it requires some level of reflection that simply isn't there. C and C++, for example, have no built-in facility to do this (and are unlikely to get one in the next decade or so). You can implement it in C++ using user-supplied specialization machinery, but it's typically quite complex and error-prone.

Maybe I'm biased because I spend most of my time with Go

Indeed. Would you mind commenting on my concerns RE lower-level languages and/or environments? Because as it stands you seem to be ignoring them. We do so at our peril.

I'm just pointing out that "🤷" is not the best way either.

I should point out that I have not been "meh" about normalization at any point. I've argued quite stridently against it, and have provided pretty thorough reasoning for doing so. I'm not going to speak for others, but I think you're doing this discussion a disservice with that assessment.

Also, to your earlier comment:

To really have an informed opinion on all of this, real-world experience and knowledge with different systems and scripts seems required, rather than just reading some specification document on unicode.org. None of us seems to have this.

TOML++ is written purely in C++, and is used on Windows, Mac OS, iOS, many flavours of Unix, Android, Emscripten, and bunch more esoteric things I hadn't even heard of before releasing it. Additionally, in the two last years or so has picked up quite large user-bases in China, Japan, and the Koreas. I've had bug reports where people are sending me TOML documents with Chinese text in the strings and comments. I've had bug reports about localization issues in German, Italian, Greek. I think at this point I've got a pretty good knowledge base on the matter. I'll tell you what I've never seen? A bug report where normalization would have fixed it. I realize that's anecdotal, but your assertion that none of us has real-world experience here is not correct.

arp242 commented 12 months ago

I am not in favour of adding normalisation @marzer, partly because I want TOML to be implementable "from scratch" without dependencies by any competent programmer. I have said this multiple times over the last years, including in my last comment. Your response is as if I strongly disagree with you, but I don't: we agree that normalisation shouldn't be added to TOML, and have agreed on this for a long time.

This is the pattern I'd generally prefer in most languages to be honest [...] e.g. scanning class attributes

This is competely impossible in some languages

"Most", not "all". "If the language supports it" is an obvious qualifier, but many (not all) languages do.

My point was just that "a table maps is a hashmap" is not always true as such, and other methods/mappings exist, and are commonly used. The question was "why are you in the "identifier" camp?" and this is my response.

I'm just pointing out that "🤷" is not the best way either.

I should point out that I have not been "meh" about normalization at any point. I've argued quite stridently against it, and have provided pretty thorough reasoning for doing so. I'm not going to speak for others, but I think you're doing this discussion a disservice with that assessment.

Well, "Six months with no response suggests it's not a pressing issue and not a regular occurrence. Unless we haven't connected with a non-English-speaking audience, and we know that isn't true" certainly seemed like "no one replied so 🤷" to me. What will be the effects for Korean, Vietnamese, etc? I don't know. I suspect none of us do. There are some indications that it may be problematic. "No one replied" is not identical to "we have investigated the matter and/or consulted with experts, and are confident that [...]".

marzer commented 12 months ago

What will be the effects for Korean, Vietnamese, etc? I don't know. I suspect none of us do.

Hah. Interestingly I edited my erlier message to somewhat cover this. I have some experience here - see above.

Your response is as if I strongly disagree with you

I'm responding as if you still want to keep normalization on the table, when IMO it's totally untenable, that's all.

I could tolerate it being an 'optional' part of the spec, but I don't want people making bug reports to me because "this higher-level language implementation does it, why doesn't yours?", or "oh yeah TOML++ passes the toml-test suite as long as you turn off this optional part". Because, to be clear: I'm absolutely not implementing it, regardless of what ends up in the spec. It's a total non-starter in a cross-platform C++ implementation.

arp242 commented 12 months ago

I'm responding as if you still want to keep normalization on the table, when IMO it's totally untenable, that's all.

To clarify: it's not "on the table" as far as I'm concerned. But I'm also not happy with how things are. This is the intrinsic difficulty with all of this that I talked about before, because none of the options are "clearly good".

or "oh yeah TOML++ passes the toml-test suite as long as you turn off this optional part".

One (out of several) of the reasons I'm not really happy for "implementation defined behaviour" (any implementation defined behaviour) is exactly because it will complicate everything in toml-test. Not just that I will need to implement something for this, but also because it complicates things for all implementers who want to use toml-test. We've already seen this with support for the TOML 1.1 draft, which is "necessary complexity", but I don't think anyone is helped by more of this.

Hah. Interestingly I edited my erlier message to somewhat cover this. I have some experience here - see above.

Right so – I didn't see that 😅

I've seen comments in many languages, but I can't recall seeing non-ASCII key names (in quoted keys, because that's your only option now); all I'm going is on things like that Stack Overflow answer I mentioned, but also things like Apple removing normalisation when they moved from HFS+ to APFS, only to reluctantly bring it back because it caused issues for people (on the other hand, I believe NTFS/Windows doesn't do this and never has, so does this cause issues there? Maybe Windows does some stuff at other layers that macOS doesn't which makes this less of an issue?)

marzer commented 12 months ago

This is the intrinsic difficulty with all of this that I talked about before, because none of the options are "clearly good".

Yeah, I get that. This is the main reason I'm in favour of explicitly codifying ordinal comparsion - it's (mostly) standardizing existing practice, and is the only truly portable way of handling the issue. I realize it still has caveats, but then there are literally zero solutions to this problem that are caveat-free, so IMO we should pick the most portable of the various evils we have available to us.

I believe NTFS/Windows doesn't do this and never has, so does this cause issues there? Maybe Windows does some stuff at other layers that macOS doesn't which makes this less of an issue?)

You're right, windows doesn't do this anywhere in NTFS (or any other filesystem, for that matter):

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. ref

...though C# and higher-level language APIs might. In practice I've never heard of it causing significant issues (though I'm sure it has for someone, somewhere). My point is mainly that so far I have not had any bug reports where normalization would have solved the issue - people combining low-level programming languages with string handling pretty quickly become aware that strings are just bags of bytes, so I doubt it's a real issue for people other than just something they need to be aware of more generally anyway.

erbsland-dev commented 12 months ago

I'm somewhat perplexed by the ongoing discussion. The original post and title suggest that the main focus is on specifying that parsers should use binary comparison for keys, while also providing warnings about potential side effects.

From what I've gathered in the comment thread, there's a general consensus that implementing normalization requirements is impractical. This is largely due to the undue burden it would place on parsers by introducing substantial dependencies, for a benefit that is marginal at best. Concurrently, there's agreement that the method of key comparison should be explicitly defined, which paves the way for a specific mandate on code-point comparisons for keys.

I concur with both of these observations. Adding a bulky dependency like ICU to a parser introduces unnecessary complexity, potential security risks, and bloat. Furthermore, a well-defined standard that eliminates ambiguity is essential; it ensures that any TOML document will be parsed uniformly across different parsers that adhere to the standard.

I also appreciate the proposed text by @eksortso as a useful clarification. However, I'd like to suggest a slight alteration in the wording of the warning:

Keys are considered identical if their code-point sequences match.

Exercise caution when using keys containing composite characters; they might appear
identical but differ in code-point values. If your use case involves such keys,
it's advisable to normalize both the TOML document and the corresponding keys
in your code before parsing.

# prénom = "Françoise", using NFC
"pr\u00e9nom" = "Françoise"

# prénom = "Françoise", using NFD
"pr\u0065\u0301nom" = "Françoise"

In my earlier comment, I also added a text about the behaviour of parsers into the warning. But, I realised this was a mistake, so the text above is a warning directed at the user of a TOML parser.

What baffles me is the ongoing debate over the pros and cons of normalization, especially when there appears to be agreement on the core issues.

wjordan commented 11 months ago

One approach I haven't seen discussed here is that instead of recommending/requiring transparent normalization of keys or asking the user to simply 'exercise caution', the spec could recommend parsers implement configurable warnings ('diagnostics' in the UTS55 lexicon) to flag potential security issues, which may be a more broadly defined and open-ended measure of irregularity/confusability/potential spoofing than simply requiring a strict normalization form.

UTS55 mentions the Rust compiler as an example, which emits warnings about the use of identifiers outside the General Security Profile, as well as confusable identifiers more generally.

As another industry example implementing this approach, the GCC compiler includes a -Wnormalized warning that flags non-NFC identifiers by default:

In ISO C and ISO C++, two identifiers are different if they are different sequences of characters. However, sometimes when characters outside the basic ASCII character set are used, you can have two different character sequences that look the same. To avoid confusion, the ISO 10646 standard sets out some normalization rules which when applied ensure that two sequences that look the same are turned into the same sequence. GCC can warn you if you are using identifiers that have not been normalized; this option controls that warning.

There are four levels of warning supported by GCC. The default is -Wnormalized=nfc, which warns about any identifier that is not in the ISO 10646 “C” normalized form, NFC. NFC is the recommended form for most uses. It is equivalent to -Wnormalized. [...] You can switch the warning off for all characters by writing -Wnormalized=none or -Wno-normalized. You should only do this if you are using some other normalization scheme (like “D”), because otherwise you can easily create bugs that are literally impossible to see.

arp242 commented 11 months ago

Continuing from #990; @marzer said:

I disagree with you that "only 17kb of memory" is fine; for some embedded environments that's absolutely a deal-breaker

How much memory does toml++ use now? I ran example/example.toml via example/simple_parser.cpp, and it uses ~210K of memory:

% clang++ -std=c++17 -O2 simple_parser.cpp
% valgrind --tool=massif --time-unit=B ./a.out

If I reduce example.toml to just one line (the title at the top) it uses ~190K, so it's not the size of the file.

But maybe I'm using it wrong? I didn't look beyond just running valgrind like this.

tomlc99 uses a lot less memory though: ~14K on the full file, or ~5K for a single-line file (via toml_cat.c).

I'm not really concerned about security @wjordan, beyond LTR control codes and the like. I suppose someone could make it appear a key was set, or make it appear something was commented out, but applications should really reject unknown keys – typos are also a security risk.

toml-lang / toml

Clarify that key uniqueness depends only on binary representation, recommend normalization #966