simdjson / simdjson

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks
https://simdjson.org
Apache License 2.0
19.22k stars 1.02k forks source link

ndjson - write overflow #1309

Closed pauldreik closed 3 years ago

pauldreik commented 3 years ago

Use the fuzzer in https://github.com/simdjson/simdjson/pull/1304

To reproduce, checkout that branch and then:

fuzz/build_fuzzer_variants.sh
mkdir -p out/ndjson
build-sanitizers-O0/fuzz/fuzz_ndjson out/ndjson

It should crash easily, within seconds.

Threads=On

base64 of non-minimized crashing input:

CQA5OAo5CgoKCiIiXyIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiJiIiIiIiIiIi
IiIiIiIiIiIiIiIiIiIiIiIiXyIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiJiIi
IiIiIiIiIiIiIiIiIiLb29vb29vb29vb29vb29vz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz
29vb29vb29vbIiIiIiIiIiIiIiIiIiIiIiIiIiIiJiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIi
IiIiIiIiIiIiIiIiIiIiIiYiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiI=
WRITE of size 32 at 0x60600005c643 thread T0
    #0 0x7f6f8186a3dc in simdjson::haswell::(anonymous namespace)::simd::base8_numeric<unsigned char>::store(unsigned char*) const /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../include/simdjson/haswell/simd.h:97:65
    #1 0x7f6f8186a3dc in simdjson::haswell::(anonymous namespace)::backslash_and_quote::copy_and_find(unsigned char const*, unsigned char*) /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../include/simdjson/haswell/stringparsing.h:35
    #2 0x7f6f8186a3dc in simdjson::haswell::(anonymous namespace)::stringparsing::parse_string(unsigned char const*, unsigned char*) /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../include/simdjson/generic/stringparsing.h:83
    #3 0x7f6f8186a3dc in simdjson::haswell::(anonymous namespace)::stage2::tape_builder::visit_string(simdjson::haswell::(anonymous namespace)::stage2::json_iterator&, unsigned char const*, bool) /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../src/generic/stage2/tape_builder.h:148
    #4 0x7f6f8186a3dc in simdjson::haswell::(anonymous namespace)::stage2::tape_builder::visit_root_string(simdjson::haswell::(anonymous namespace)::stage2::json_iterator&, unsigned char const*) /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../src/generic/stage2/tape_builder.h:158
    #5 0x7f6f8186a3dc in simdjson::error_code simdjson::haswell::(anonymous namespace)::stage2::json_iterator::visit_root_primitive<simdjson::haswell::(anonymous namespace)::stage2::tape_builder>(simdjson::haswell::(anonymous namespace)::stage2::tape_builder&, unsigned char const*) /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../src/generic/stage2/json_iterator.h:284
    #6 0x7f6f8186a3dc in simdjson::haswell::(anonymous namespace)::stage2::tape_builder::visit_root_primitive(simdjson::haswell::(anonymous namespace)::stage2::json_iterator&, unsigned char const*) /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../src/generic/stage2/tape_builder.h:97
    #7 0x7f6f8186a3dc in simdjson::error_code simdjson::haswell::(anonymous namespace)::stage2::json_iterator::walk_document<true, simdjson::haswell::(anonymous namespace)::stage2::tape_builder>(simdjson::haswell::(anonymous namespace)::stage2::tape_builder&) /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../src/generic/stage2/json_iterator.h:140
    #8 0x7f6f8186a3dc in simdjson::error_code simdjson::haswell::(anonymous namespace)::stage2::tape_builder::parse_document<true>(simdjson::haswell::dom_parser_implementation&, simdjson::dom::document&) /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../src/generic/stage2/tape_builder.h:93
    #9 0x7f6f8186a3dc in simdjson::haswell::dom_parser_implementation::stage2_next(simdjson::dom::document&) /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../src/haswell/dom_parser_implementation.cpp:152
    #10 0x537635 in simdjson::dom::document_stream::next() /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../include/simdjson/dom/document_stream-inl.h:175:35
    #11 0x52fdcd in simdjson::dom::document_stream::iterator::operator++() /home/paul/code/delaktig/simdjson/build-sanitizers-O0/../include/simdjson/dom/document_stream-inl.h:125:10
lemire commented 3 years ago

It seems that the following returns an error condition...

     simdjson::dom::parser parser;
     simdjson::padded_string input = decode_base64("CQA5OAo5CgoKCiIiXyIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiJiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiXyIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiJiIiIiIiIiIiIiIiIiIiIiLb29vb29vb29vb29vb29vz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz8/Pz29vb29vb29vbIiIiIiIiIiIiIiIiIiIiIiIiIiIiJiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiYiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiI=");
     print_hex(input);
     simdjson::dom::document_stream stream;
     auto error = parser.parse_many(input).get(stream);

In hexadecimal, the base64 translates to

09 00 39 38 0A 39 0A 0A 0A 0A 22 22 5F 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 26 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 5F 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 26 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 DB DB DB DB DB DB DB DB DB DB DB DB DB DB DB F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 F3 DB DB DB DB DB DB DB DB DB 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 26 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 26 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 00

See https://cryptii.com/pipes/base64-to-hex to verify. This is clearly invalid.

The stack error fails at simdjson::dom::document_stream::iterator::operator++() but it should never make it to the iterator since parse_many reveals an error.

Running the tests with sanitizers does not appear to reveal an issue...

$ git checkout  dlemire/document_stream_fuzz_issues
$ cmake -DSIMDJSON_SANITIZE=ON -Bstream_issues
$  cmake --build stream_issues --target document_stream_tests
$ ./stream_issues/tests/document_stream_tests

Reference: https://github.com/simdjson/simdjson/pull/1318/files

lemire commented 3 years ago

Accidentally, one could set the batch size to an unreasonable value (e.g., 0): let us guard against it: https://github.com/simdjson/simdjson/pull/1319

lemire commented 3 years ago

@pauldreik

It seems that we may have had a thread safety issue in the sense that the buffer could be deleted before the thread was stopped.

lemire commented 3 years ago

Closing (assumed fixed).