encoding/json: clarify Decoder.InputOffset semantics

dsnet commented 3 years ago

The documentation on Decoder.InputOffset currently says:

The offset gives the location of the end of the most recently returned token and the beginning of the next token.

This explanation is somewhat incorrect since the offset cannot be a single value if there are intervening whitespace in-between two JSON tokens. Furthermore, the json package implicitly handles colons and commas, which are technically JSON tokens according to RFC 7159, section 2. May the offset ever be past these implicit colons and commas, or will it always be before them? The "token" in this context is a somewhat ambiguous.

The documentation should be clarified to say one of the following:

The offset may be between the end of the most recently returned token (exclusive), and before the beginning of the next token that may be returned (inclusive).
The offset points to the next character immediately after the most recently returned token.
The offset points to the first character of the next token that may be returned.

I chose the phrase "may be returned" to indicate that colons and commas aren't included since they aren't ever returned by the json API. I haven't checked the implementation yet to see what it actually returns.

The first definition gives more flexibility to how the Decoder may actually be implemented. The latter two definitions gives the user more assurance about what InputOffset actually means.

Updates #29688

mvdan commented 3 years ago

I agree that the current definition is a bit ambiguous, happy to review a CL. I assume we should just document the current behavior.

Windsooon commented 2 years ago

I love the third idea, how about:

"The offset points to the first character of the next token that may be returned. (except colons or commas)"

The current comments and implement at src/encoding/json/stream.go didn't treat colons or commas as Token:

// A Token holds a value of one of these types: // // Delim, for the four JSON delimiters [ ] { } // bool, for JSON booleans // float64, for JSON numbers // Number, for JSON numbers // string, for JSON string literals // nil, for JSON null

It looks like we have to change a lot of comments if we want to fix it.

dsnet commented 2 years ago

"The offset points to the first character of the next token that may be returned. (except colons or commas)"

This is the most user friendly semantic, but breaks in other ways since it requires reading from the input stream until it locates the next token. This would cause the Decoder to block on a read (e.g., from a network socket) in cases when it otherwise would not have.

It looks like we have to change a lot of comments if we want to fix it.

A JSON token is a well defined in RFC section 2. I don't think we need to enumerate every possible case.

golang / go

encoding/json: clarify Decoder.InputOffset semantics #42571