ibireme / yyjson

The fastest JSON library in C
https://ibireme.github.io/yyjson/doc/doxygen/html/
MIT License
3.04k stars 262 forks source link

Add option to allow reading invalid unicode #78

Closed pavelxdd closed 2 years ago

pavelxdd commented 2 years ago

I've been using this patch for months in my project, since I don't need json reader failing when there are invalid suquences in json strings.

Can we add a new option YYJSON_READ_ALLOW_INVALID_UNICODE (optional via YYJSON_DISABLE_NON_STANDARD) to allow parsing invalid unicode data?

codecov[bot] commented 2 years ago

Codecov Report

Merging #78 (545b8ab) into master (c2df748) will decrease coverage by 0.08%. The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master      #78      +/-   ##
==========================================
- Coverage   97.68%   97.59%   -0.09%     
==========================================
  Files           2        2              
  Lines        2158     2163       +5     
==========================================
+ Hits         2108     2111       +3     
- Misses         50       52       +2     
Impacted Files Coverage Δ
src/yyjson.h 93.28% <ø> (ø)
src/yyjson.c 97.88% <83.33%> (-0.10%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c2df748...545b8ab. Read the comment docs.

ibireme commented 2 years ago

Incorrect UTF8 strings are dangerous and can make your code vulnerable.

For example:

    size_t size = 3;
    char *buf = malloc(size);
    buf[0] = '"';
    buf[1] = 0xF0;
    buf[2] = '"';
    yyjson_doc *doc = yyjson_read(buf, size, YYJSON_READ_ALLOW_INVALID_UNICODE);

    char *str = yyjson_write(doc, YYJSON_WRITE_ESCAPE_UNICODE, NULL);

While the writer sees 0xF0, it assumes that here is a 4-byte encoded unicode code point. It continues to read the next 4 bytes, causing a heap-buffer-overflow error.

pavelxdd commented 2 years ago

You're right, but an issue in your example is not with the reader, but with the writer. The reader correctly parses the input string, and stores a string with length 1 in the root.

The write_string function has str_len argument which is 1 and shouldn't try to access the buffer outside of specified range.

IMO the writer should report unicode errors like the reader does now, and optionally (a new option maybe) allow writing these values. What do you think?

I don't want to check (with other libs or how?) all my input data whether it is a valid UTF-8 or not, instead yyjson should check the str_len inside write_string.

TkTech commented 2 years ago

@ibireme it is possible to do this safely, as @pavelxdd suggests, and simdjson does do this.

This can have benefits. Some language wrappers, like Python, will always do UTF8 validation when creating the high-level str, so I end up validating twice when loading data.

ibireme commented 2 years ago

Yes, the risk is not the with reader, but with the caller who uses these strings. We should add more comments to make users aware of this risk.

With the default option, the strings are already validated by the reader, so there is no need for the writer to do validation, additional validation will degrade the performance of serialization. Perhaps a new option like "validate unicode" could be added later.

@TkTech The pull request will still do validation, but continues on error, so performance will not change.

pavelxdd commented 2 years ago

@ibireme my point was that the current writer behaviour is strange. You don't usually expect the function with size constrains in its arguments to try accessing the memory outside of specified region. Imagine the standard functions memcmp or memcpy that have a size argument accessing the memory out of this size? It'll be a nightmare.

The issue with the writer doesn't have anything to do with the reader validation. Yes, reader currently does validation, but what if I need to construct a JSON from some data, I would use yyjson_mut_strn (which has a len argument, but writer function doesn't care and can access the memory outside of this len, apparently).

From what I can see, write_string doesn't need to check the bounds on every iteration. You should only check the last 3 bytes of the string - that's all. No perfomance loss on it.

Otherwise, do you expect the user to manually validate the input data before calling yyjson_mut_strn every time? And there are no public APIs in yyjson to do that, too.

pavelxdd commented 2 years ago

I've added a warning to YYJSON_READ_ALLOW_INVALID_UNICODE about security issues when dealing with invalid unicode inside JSON.

Do you mind if I open a separate pull request with a new option YYJSON_WRITE_ALLOW_INVALID_UNICODE which will add an optional unicode validation in writer when YYJSON_WRITE_ESCAPE_UNICODE is enabled?

ibireme commented 2 years ago

Checking the last 3 bytes extra for each string will certainly cause performance degradation. And it's not enough to just make a length check.

Another example:

    size_t size = 6;
    char *buf = malloc(size);
    buf[0] = '"';
    buf[1] = 0xF0;
    buf[2] = 'a';
    buf[3] = 'b';
    buf[4] = 'c';
    buf[5] = '"';

    yyjson_doc *doc = yyjson_read(buf, size, YYJSON_READ_ALLOW_INVALID_UNICODE);

    size_t len;
    char *str = yyjson_write(doc, YYJSON_WRITE_ESCAPE_SLASHES, &len);
    // "\uD846\uDCA3"

If you don't check each byte, the wrong byte will break the following utf-8 sequences.

When creating a string value, the input is controlled by the user, so I think it's enough to have this hint: https://github.com/ibireme/yyjson/blob/master/src/yyjson.h#L1655

pavelxdd commented 2 years ago

You are right about checking the last 3 bytes.

With the option YYJSON_WRITE_ALLOW_INVALID_UNICODE introduced in #79 the writer will correctly produce the output string \xF0abc.

Edit: you mean YYJSON_WRITE_ESCAPE_UNICODE, not YYJSON_WRITE_ESCAPE_SLASHES? I agree, there needs to be a proper validation done on each iteration.

ibireme commented 2 years ago

I tried to add full unicode validation for writer and ran a benchmark.

M1 Pro, minify json:
github.json   (ASCII)        3.26GB/s -> 3.11GB/s (95%)
twitter.json  (partial CJK)  3.56GB/s -> 3.33GB/s (93%)
poem.json     (full CJK)     3.21GB/s -> 2.63GB/s (82%)

I think this performance drop is acceptable. I will add the unicode validation for writer later, so that reader and writer will have the same behavior.

pavelxdd commented 2 years ago

Thanks. Should I close #79 then?

ibireme commented 2 years ago

Thanks. Should I close #79 then?

Sure, I will re-implement YYJSON_WRITE_ALLOW_INVALID_UNICODE option later.