skeeto / pdjson

C JSON parser library that doesn't suck
The Unlicense
281 stars 36 forks source link

Few more fixes and improvements #19

Closed boris-kolpackov closed 4 years ago

skeeto commented 4 years ago

I agree with your note about the strangeness of "01" being two separate events. This was not a deliberate behavior, just a side effect. Also as you point out, it's not necessarily wrong either. We probably should require whitespace, but I'm going to ask around for other people's thoughts, as well as see what other parsers do. Not many parsers support JSON streams, or when they do the requirements are very strict (one value per line).

I'm seeing that "jq" allows leading zeros on numbers (but it's still not octal), so it doesn't have to make a decision on this issue, though it does mean it accepts invalid JSON.

I have a "strict-ansi" branch where I've done a lot of this cleanup, but I also made some breaking changes. I've considered making that branch a 2.0 version of the library.

boris-kolpackov commented 4 years ago

Thanks for the quick response.

Regarding the streaming mode, yes, this would be a backwards-incompatible change. Perhaps (thinking out loud here), we can configure this mode with two strings: the first containing the required separator characters and the second -- allowed (with NULLs resulting in the current behavior). The semantics would be that there should be at least one required and any number of optional separators between any two value. We would need to have a special case for allowing {..}{..} and/or [...][...]. For example, we could treat closing }/] as the first separator character to consider in the required string.

Should I create an issue to capture this discussion?

Regarding strict-ansi, I looked at that branch when deciding what to base the build2 package on, but concluded it's not necessarily the next/better version but rather an alternative. Generally, I personally don't see value in using antiquated standards if a more modern (C99) is generally available.

skeeto commented 4 years ago

I discussed the "01" issue with some friends, acquaintances, co-workers, and even some strangers, then I thought about it some more. Ultimately I think the current behavior is correct and logical despite being a bit counter-intuitive. It's also consistent with both Firefox and Chrome's JSON.parse function. I'm going to write a whole article on this, but here's the summary:

Consider a non-concatenated case of "01" like "[01]". You might think the error is that there's a leading zero, but if you look at it from a JSON parser's point of view, that's not what's wrong. Any reasonable JSON parser sees array-begin, number:0, and then finally runs into a parsing error looking for either a comma or closing square brace. We can observe this in the error messages from Firefox:

SyntaxError: JSON.parse: expected ',' or ']' after array element

And Chrome/Chromium:

SyntaxError: Unexpected number in JSON at position 2 at JSON.parse

Neither complain about leading zeros, they're both confused about seeing "1" where they expect to find a comma. Perhaps a better diagnostic (for humans) would be to examine the broader situation, guess that this was intended to be a leading zero, and mention that in the error message. Since JSON doesn't even have the concept of a leading zero, and so parsers aren't looking for one, there's no natural diagnostic for it.

There's a similar surprise sticking other "primitive" values against each other: "truefalse". When pdjson parses this as a concatenation, it produces two boolean values. Again, Firefox and Chrome do the same thing. They successfully parse "true" then error at seeing "false" instead of a comma.

Since the parser is working as it should, it's on producers of concatenated JSON to be reasonable in what they produce.

boris-kolpackov commented 4 years ago

I agree with everything you've said and in the strict (i.e., non-streaming) mode the parser behaves correctly (even if the diagnostics is not very intuitive).

My problem is with the streaming mode: currently there is no way to require a certain separation and diagnose any violations. Specifically, the two cases, 01 and 0 1, are currently indistinguishable. Similarly, {...}{...}, {...} {...}, and

{...}
{...}

Are all treated the same.

boris-kolpackov commented 4 years ago

After thinking some more on this, perhaps an even simpler solution is to not read the trailing whitespaces in the streaming mode and instead let the user read and validate them instead (and if the user doesn't read them, they will still be ignored, just as leading whitespaces, not trailing). The only other thing that we will need to do for this approach is to provide public API for the json->source.{get,peek}() calls (so that the user can validate the separation, if necessary).

What do you think?