ibireme / yyjson

The fastest JSON library in C
https://ibireme.github.io/yyjson/doc/doxygen/html/
MIT License
3.04k stars 262 forks source link

Wrong read error #103

Closed lnkuiper closed 1 year ago

lnkuiper commented 1 year ago

Describe the bug A wrong read error is given when a JSON string is terminated at a specific place. For example:

{"duck": 42.

Will give a YYJSON_READ_ERROR_INVALID_NUMBER. However, it should be YYJSON_READ_ERROR_UNEXPECTED_END.

Your environment

Additional context I've implemented a streaming read in DuckDB using yyjson, and we read a file block-by-block. The block boundaries are arbitrary, so this happened while running a test.

I've changed occurrences of return_err(cur - 1, INVALID_NUMBER, msg); in yyjson.cpp to return_err(cur, INVALID_NUMBER, msg);, which seems to solve the problem.

ibireme commented 1 year ago

Thanks for your feedback~ I've fixed the incorrect error codes and added some test cases for them: https://github.com/ibireme/yyjson/commit/719b310ddf0a972a3f2ace140faaf207fccd6303

lnkuiper commented 1 year ago

Hi @ibireme ,

I have run into a similar issue again, this time with literals. If we read a JSON that has been cut off like this:

{"duck":tru

We get a YYJSON_READ_ERROR_LITERAL, but it should be YYJSON_READ_ERROR_UNEXPECTED_END. This is because the original pointer is not updated by the functions that read literals (e.g., true, false, inf, etc.). For example:

static_inline bool read_true(u8 **ptr, yyjson_val *val) {
    u8 *cur = *ptr;
    u8 **end = ptr;
    if (likely(byte_match_4(cur, "true"))) {
        val->tag = YYJSON_TYPE_BOOL | YYJSON_SUBTYPE_TRUE;
        *end = cur + 4;
        return true;
    }
    return false;
}

Should increment *ptr to where the error happens before returning false.

ibireme commented 1 year ago

@lnkuiper I think I understand your need now: a valid JSON truncated at any position should return exactly YYJSON_READ_ERROR_UNEXPECTED_END.

I just returned UNEXPECTED_END as a fallback before and didn't take care of this usage. So there is still some work to be done, especially to handle truncated unicode strings accurately.

lnkuiper commented 1 year ago

@ibireme exactly. I have a workaround that seems to work fine, so fixing this is not that important, but I thought I would report it anyway.

ibireme commented 1 year ago

I've fixed it by adding a new function that can detect whether a JSON that fails to parse is invalid, or valid but truncated: https://github.com/ibireme/yyjson/commit/8ffa6f54df76242068e39a613893fbe8c3e6f8ab

I tested it with the JSONTestSuite dataset, and it seems to be working as expected.

lnkuiper commented 1 year ago

@ibireme This will help a lot with identifying problems when doing streaming JSON reads. Thank you so much!