google / gson

A Java serialization/deserialization library to convert Java Objects into JSON and back

Apache License 2.0

23.34k stars 4.28k forks source link

Proposal to expose Position information used for error messages in JSonReader in public API #2373

Open jurgenvinju opened 1 year ago

jurgenvinju commented 1 year ago

Thanks for maintaining this library!

Problem solved by the feature

Many projects use JSON notation for their DSL's syntax
Reusing Gson implies high quality and compatible parsing for those DSLs
IDE support for DSLs requires position information (offset/length, start line/column, end line/column
For example to create Language Servers for the LSP for the given (embedded) DSLs

Feature description

Proposal to expose the following fields of JsonReader:

private int pos = 0;
private int lineNumber = 0;
private int lineStart = 0;

With these public methods:

public int getPosition() {
  return pos;
}

public int getLineNumber() {
  return lineNumber + 1; // posix has the first line at 1
}

public int getColumnNumber() {
  return pos - lineStart; // posix has columns start at 0
}

Using these methods just before beginObject() and just after endObject() you get an accurate measurement of the span of every object in the character stream.

Alternatives / workarounds

We currently use the reflection API to access these fields in the usethesource/rascal meta-programming project.
That seems to work fine, but it's not as nice as it could be :-)

Marcono1234 commented 1 year ago

We currently use the reflection API to access these fields

How reliable are these values though? It has been a while since I had a look at that but there might be cases where JsonReader updates these values before the token has actually been consumed (probably related: #1764).

jurgenvinju commented 1 year ago

So far so good. How about I test them for a few weeks and see what comes up?

The pos value can indeed be off due to "lookahead" which is later corrected by subtracting again. It matters a lot when you ask for the position. So far asking before or after beginObject and endObject seems to be completely safe. The code hints at issues when items are missing completely, such as empty arrays.
The lineNumber counter is extremely simple and does not seem to be in danger of being incorrect
lineStart is simple in similar ways.

Marcono1234 commented 1 year ago

How about I test them for a few weeks and see what comes up?

You don't have to test this only for this; I was just curious if you are already aware of any issues.

My main concern is that by exposing this information with a public API JsonReader would make a strong commitment that the values are correct (or at least users might expect this). Whereas currently the location information in the exception messages is probably on a best effort basis (?). A big problem with exposing this information is also that it exposes the implementation details of when and how JsonReader skips whitespace or JSON tokens. For example a call to getPosition() would have to guarantee that all whitespace in front of the next token has been skipped. Similarly a call to getPosition() after reading an array item or a object member name must not have skipped , or : (or at least I assume that is what users expect). Maybe JsonReader already adheres to that by accident, but it imposes some restrictions on how JsonReader can parse the data.

But maybe these concerns are unjustified and supporting this in a reliable way would not be that difficult.

Note that I am not a direct member of this project; they might think differently about this.

jurgenvinju commented 1 year ago

That makes perfect sense. Your concern is with correctness and what the definition of it is.

Let's have a go at defining what a "correct" implementation of getPosition() would do, and relate that specifically to the post-conditions of every other method in JsonReader. Later we can see if this definition is indeed implemented by the class and then we also know how to test for these properties. pos^ means pos before executing a method and pos is after executing the method.

beginArray: input[pos] == '['
endArray: input[pos-1] == ']'
beginObject: input[pos] == '{'
endObject: input[pos-1] == '}'
hasNext: pos^ == pos (pos did not change)
peek(): pos^ == pos
String nextName(): pos = pos^ + length(returnValue)
String nextString(): input[pos^] == '\"' ==> input[pos-1] == '\"' && pos >= pos^ + length(returnValue) // unescaping makes the string possibly shorter, but not longer
String nextBoolean(): pos = pos^ + length(returnValue)
String nextNull(): pos = pos^ + 4; // 'null' has four characters
String nextDouble(): pos >= pos^ + length(returnValue) // different unicode escapes can still lead to shorter string doubles
String nextLong(): pos >= pos^ + length(returnValue) // different unicode escapes can still lead to shorter string longs
String nextInt(): pos >= pos^ + length(returnValue) // different unicode escapes can still lead to shorter string longs
void close(): pos == pos^
skipValue(): pos > pos^
toString(): pos == pos^
getPath(): pos == pos^

WDYT? did I get this right?

jurgenvinju commented 1 year ago

For example a call to getPosition() would have to guarantee that all whitespace in front of the next token has been skipped.

This one follows from post-conditions such as input[pos]=='[' ; I've tried not to generalize to "all" whitespace and "any" token, in order for the post-conditions to remain as simple as possible. Nevertheless the general idea is exactly as you suggested.

Marcono1234 commented 1 year ago

did I get this right?

Note sure for hasNext() and peek(). The current implementation does not eagerly peek at the next value, but only once you call these methods (or any value consuming method). Similarly all assumptions about for pos^ might only hold if you first either called hasNext() or peek() and then afterwards pos^ = getPosition() before calling methods for consuming the value.

For array and objects I think it should be input[pos^] = [ | { | ] | } and input[pos-1] = [ | { | ] | }, shouldn't it? That is, pos^ should point at the opening / closing bracket and pos should point right behind it.

For nextString() I think it should be pos >= ... + 1 to account for the closing " (which is at pos - 1, as shown in your comment).

Another point might be lenient-mode of JsonReader; I am not sure if it makes sense to make any guarantees there for the position methods or if the behavior there should just be declared undefined, because detection of unquoted strings and for example trailing , being an implicit null might lead to incorrect positions.

Marcono1234 commented 1 year ago

Maybe it would be easier if you looked for a JSON parser which is explicitly designed to report accurate position information, possibly one which is used by IDEs such as in IntelliJ IDEA; haven't checked though which library they are using.

Or alternatively a parser generator library such as ANTLR could be used, they already have an existing grammar for JSON: https://github.com/antlr/grammars-v4/blob/master/json/JSON.g4 (but the licensing might be a bit unclear, see https://github.com/antlr/grammars-v4/issues/1607) That would have the advantage that you get accurate locations for start and end of tokens.

eamonnmcmanus commented 1 year ago

I agree with the points made by @Marcono1234. If we exposed this information, there might be some impact on performance of certain cases (because the position we record now doesn't quite make sense). We would also be tying our hands for future parser changes, since we'd need to be sure to preserve the reported positions.

I think the current situation, where you can get this information but only through reflection, is probably better. That makes it clear that you're going through a back door and that the reported positions might change in future versions of Gson. We probably wouldn't change them gratuitously: if you want, you can contribute a test that would fail if they did change. We still wouldn't guarantee anything, but at least we would know if one of our changes was going to alter the reported positions.

Fundamentally I think there's a conflict between a fast parser, which Gson aims to be, and a detailed one. It might indeed be better to look at another JSON library.