w3c / json-ld-syntax

JSON-LD 1.1 Specification
https://w3c.github.io/json-ld-syntax/
Other
112 stars 22 forks source link

Avoid floating-point number for `@version` #296

Closed lo48576 closed 4 years ago

lo48576 commented 4 years ago

The API provides an option for setting the processing mode to json-ld-1.0, which will prevent JSON-LD 1.1 features from being activated, or error if @version entry in a context is explicitly set to 1.1.

https://w3c.github.io/json-ld-syntax/#dfn-processing-mode

JSON-LD processors are required to check @version value if it is found at appropriate place, but floating-point number comparison is unnecessarily complex for this purpose.

IEEE 754 floating point number cannot represent exact 1.1, and programs may use various (not same) representation very close to 1.1 (but not exact 1.1). This can make straightforward implementation (for example doing if (version == 1.1) { ... }) to fail checking version correctly. To deal with this problem, processors are forced to compare values in complex way (such as if (version >= 1.1 - EPSILON && version <= 1.1 + EPSILON) { ... }).

JSON-LD version is neither a real number nor a floating-point number. I think it should be string "1.1" or something exactly representable value, instead of floating-point number.

lo48576 commented 4 years ago

About version not being floating-point number: what happens after version 1.9? version 1.10 (== 1.1)?

gkellogg commented 4 years ago

It's unlikely that we'll never need to change the version number, although subsequent versions may choose to include it. The primary use for it is so that 1.0 processors will fail when they see it, as the context processing wasn't adequately constrictive in 1.0. It is in 1.1, so that if something changes in the future, a 1.1 processor would detect and fail.

Note that JSON numbers may be represented as floating point internally after conversion to a native form, but the JSON spec doesn't require this. If an implementation had a problem with a floating point conversion, they could handle this as they see fit, and really simply check to see if it exists without parsing the number. IIRC, there are no test for numeric values other than 1.1.

davidlehn commented 4 years ago

There should be a FAQ somewhere for this. The question will come up often. Where do we put something like that?

I've never liked the non-string version either, but the most succinct alternatives were not much better. I think ["1.1"] was the best? A bit late in the process to discuss this again unfortunately.

It's much more likely we'll have 2.0 before hitting a 1.10 problem.

@lo48576 Did you have a real situation where a simple 1.1 comparison didn't work? I'm curious what system would make this more than a theoretical problem.

lo48576 commented 4 years ago

@gkellogg

Note that JSON numbers may be represented as floating point internally after conversion to a native form, but the JSON spec doesn't require this.

Yes, but I think usual JSON-LD processors may use third-party library to parse JSON document into native structure, and the raw string representation might not be available after the parsing.

@davidlehn

I think ["1.1"] was the best?

Tuple ([1, 1] in JSON?) would be also OK (fixed-length is better).

A bit late in the process to discuss this again unfortunately.

JSON-LD 1.1 is currently working draft and updated frequently, and I think it is the best if the spec for @version is updated to be consistent with future versions, before reaching candidate recommendation (and before real-world applications starting to use 1.1).

Did you have a real situation where a simple 1.1 comparison didn't work?

Currently, no. In my environment, naive comparation works for 1.1, but it is not guaranteed by anyone to work as expected in future. However, version manipulation (in future) would be also unnecessarily complex, because 1.1 + 0.1 and 1.2 is NOT equal. (Of course implementations can internally convert 1.1 to "1.1" and "1.2" to "1.2", but then why not use "1.1" from the beginning?)

From point of view of code quality, naive comparison are almost always discouraged.

lo48576 commented 4 years ago

I'm trying to develop JSON-LD processing library in Rust programming language, and I'm worried about situations below. These forces the processor implementation to use unnecessarily complex code.

  1. If binary internal representation of JSON-LD documents are stored to DB, future version of processors might fail to check @version in naive way.
  2. Multiple way to parse or create floating-point number can coexist in a program.
  3. Generally, it is discouraged to check equality of floating point numbers in naive way.
  4. "Correct" way of comparison brings inevitable (excessive) flexibility to the implementation.

Binary internal representation and compatibility among future processors

String representation has some overhead (parsing, delimiters, whitespaces, etc.), so applications using JSON-LD may put binary representation of JSON-LD document (in CBOR or BSON, for example) to DB. However, the way processors parse string 1.1 to binary 1.1 might differ among different processor implementations, and among different versions of the same libraries or applications. This makes it hard to guarantee that the applications can be updated safely (without bringing new problem).

Multiple way to parse or create floating point number in a program

Floating point numbers can be created by different algorithm in a single program, typically by compiler and by JSON parser. (See https://github.com/serde-rs/json/issues/536 for example. This may not cause difference for 1.1, but this is a real example of causing different representation for the "same" number.) This makes it almost impossible to guarantee that the two floating point numbers has the same representation if came from different places.

It is discouraged to check equality of floating point numbers in naive way

Comparison 1.1 == parse_as_json("1.1") worked as expected for my environment, but it is not guaranteed to work in any environment and any time. Such naive comparisons are usually discouraged, and some compilers and linters may warn it. (I haven't heard the situation where the naive equality check is recommended...)

Complex comparison makes it hard to make justification

If I use if (version > 1.09 && version < 1.11) to check the version is 1.1, readers of the code (including myself in future) may think "Why the developer choose ±0.01? The range seems too wide / too narrow!" I can say "this works so it's ok", but I cannot tell why this value should be used. Such codes and constants without necessity makes it hard to read and maintain the code.

TallTed commented 4 years ago

I should think that JSON-LD @version should be using semantic versioning, and that each dot-segment must be treated as an integer -- i.e., 1.1 is the sequence one-dot-one, not the decimal one-point-one. (Even if not using semantic versioning, this segmentation is the way version numbers are generally meant by developers, though they are often misinterpreted and mishandled by marketing and other non-programming personnel.)

For most purposes, version values should be treated as strings (not as numbers), and only converted to numerics when necessary to order by version -- and as noted above, this must be done segment-by-segment, or the ordering will be incorrect (putting 1.11 [one-dot-eleven] between 1.1 [one-dot-one] and 1.2 [one.dot.two])

Any other handling of version values is fraught with peril, as the above discussion highlights.

(It's probably not relevant to JSON-LD, per se, but the issues become clearer and potentially more problematic when you get to 3 and 4 segment versions, e.g., 1.23.42.1235.)

gkellogg commented 4 years ago

The choice of a numeric datatype for @version is deliberate. A JSON-LD 1.0 processor would accept "@version": "1.1" and create a term. The whole purpose of using a numeric value is so that a JSON-LD processor which had not been updated to at least do better range checking of values in a context would throw an error. As @davidlehn noted, we could have used an array representation as well, although this was not considered as aesthetically pleasing.

As I also mentioned, a particular numeric value should not be important, and I don't see that future JSON-LD specifications would need to lean on this mechanism, as the 1.1 spec now is very explicit on what keys can be in a term definition, and what values @container, as well as other term properties can take. (We do ignore unknown keyword-like terms and issue a warning, but perhaps we should error on these as well).

While "@version": "1.1" would have been a logical choice, it would not have accomplished the goal of creating a value that a 1.0 processor would have rejected.

We could consider that any numeric value could be used, which would accomplish the same objective.

If 1.0 had been more prescriptive on what things can be in a context or a term definition, we wouldn't have need to introduce @version at all.

pchampin commented 4 years ago

We could consider that any numeric value could be used, which would accomplish the same objective.

Well, that would put a future JSON-LD WG in the same uncomfortable situation as the one we started with: if they want to mark some strictly 1.2 or 2.0 data, and be sure that any 1.0 or 1.1 processor rejects it, they would need to introduce "@version": <neither a string nor a number> to achieve that!...

gkellogg commented 4 years ago

By locking down what can be in a context or term definition, and elsewhere in the expansion algorithm, we don’t need @version announcement to detect these features. A 1.1 processor would reject things introduced in a future version. If we had done this in 1.0, we could have avoided @version altogether.

pchampin commented 4 years ago

@gkellogg Locking down contexts and term definitions does not entierly solve the problem. There might be situations were the only difference is in the data, not in the context.

Imagine that JSON-LD 2.0 introduce a way to annotate triples with properties (alla Property Graph). This could be done this way:

{
  "@context": { "@version": 2.0, "@vocab": "http://example.org/ns/" },
  "@id": "#alice",
  "name": "Alice",
  "spouse": {
    "@object": {
      "@id": "#bob",
      "name": "Bob"
    },
    "since": "2001-02-03"
  }
}

Here, the new keyword @object would changes the semantics of the map, so that since is a property of the triple #alice :spouse #bob.

Except for the 2.0 version, this data is accepted by a 1.1 processor, but the @object key and its value (description of Bob) are ignored, and the property since is applied incorrectly to Alice's spouse. So it makes sense for the processor to reject this data if @version is not 1.1.

Of course, some new keywords might be safely ignored, that is why (I think) we decided to ignored unknown keywords rather than rejecting them systematically. But future versions should have a way to make their data unacceptable by prior processors.

gkellogg commented 4 years ago

@pchampin Of course, you're correct. To be complete, we'd also need to make those keys illegal outside of the context, which would violate what we were trying to achieve in the first place.

Living with "@version": 1.1 seems like the best bet. We could consider carving out a namespace, such as terms beginning with @ld, but that seems overly limiting and not that satisfying.

Regarding semantic versioning (@TallTed), in principle, I agree, but this is the mechanism we're left with due to issues in 1.0. I don't see too much interest in changing to use an array form, such as ["1", "1"].

iherman commented 4 years ago

This issue was discussed in a meeting.

azaroth42 commented 4 years ago

Closed by #301