simdjson / simdjson

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks
https://simdjson.org
Apache License 2.0
19.06k stars 1.01k forks source link

spurious error when reusing a document with on demand (sticky error bit) #1601

Closed marmeladema closed 3 years ago

marmeladema commented 3 years ago

Hello!

I am working on a rust project that is using simdjson internally and in order to easily interface with it, I have created a small C shim where all simdjson classes are hidden and everything is exposed through abstract pointers and new/free like functions.

Given the current explicit warning in the documentation:

Only one iteration at a time can happen per parser, and the parser must be kept alive during iteration to ensure intermediate buffers can be accessed. Any document must be destroyed before you call parse() again or destroy the parser.

It means that I have to free and allocate a new document each time I need to call .iterate() on the parser instance.

I would like not to have to pay that allocation cost and re-use a document instance each time I need to parse something. Would doing the following be considered good practice?

document->~document();
*document = simdjson::ondemand::document{};
error = parser->iterate(...).get(*document);

Additionally, I am definitely not C++ fluent but I've read here and there that calling destructors explicitly was a bad design. Is there another API available in order to achieve this?

Thank you in advance for your help!

marmeladema commented 3 years ago

It seems that the code snippet I mentioned before does not actually work.

Here is an example of trying to re-use a document object after having encountered a parsing error:

#include <iostream>

#include "simdjson.h"

using namespace std;

struct simdjson_context {
    simdjson::ondemand::parser parser;
    simdjson::ondemand::document document;
};

int main() {
    simdjson::error_code error;
    simdjson_context *context = new simdjson_context();

    auto json = R"({"key": "value")"_padded;

    cout << "Step 1: parsing truncated document should fail" << endl;

    {
        auto error = context->parser.iterate(json.data(), json.size(), json.size() + simdjson::SIMDJSON_PADDING).get(context->document);

        if (error == simdjson::SUCCESS) {
            simdjson::ondemand::object object;
            error = context->document.get_object().get(object);

            if (error == simdjson::SUCCESS) {
                // Never reached because an error is reported
                abort();
            } else {
                cout << "get_object() error: " << simdjson::error_message(error) << endl;
            }
        } else {
            cout << "iterate() error: " << simdjson::error_message(error) << endl;
        }
    }

    json = R"({"key": "value"})"_padded;

    cout << "Step 2: parsing a valid document should succeed" << endl;

    {
        context->document.~document();
        context->document = simdjson::ondemand::document();
        auto error = context->parser.iterate(json.data(), json.size(), json.size() + simdjson::SIMDJSON_PADDING).get(context->document);

        if (error == simdjson::SUCCESS) {
            simdjson::ondemand::object object;
            error = context->document.get_object().get(object);

            assert(error == simdjson::SUCCESS);

            simdjson::ondemand::field field;
            for (auto it = object.begin(); it!=object.end(); ++it) {
                error = (*it).get(field);
                if (error == simdjson::SUCCESS) {
                    cout << "key = " << field.key() << endl;
                } else {
                    cout << "get() error: " << simdjson::error_message(error) << endl;
                }
            }
        } else {
            cout << "iterate() error: " << simdjson::error_message(error) << endl;
        }
    }

    return 0;
}

And it prints:

Step 1: parsing truncated document should fail
get_object() error: The JSON document has an improper structure: missing or superfluous commas, braces, missing keys, etc.
Step 2: parsing a valid document should succeed
get() error: The JSON document has an improper structure: missing or superfluous commas, braces, missing keys, etc.

Step 1 behaves properly, an error is reported but when trying to re-use the same document instance on new data (valid this time), and error is reported again whereas it should not. I understand that it's probably because re-using a document is not meant to work but having such a capability would be really helpful.

lemire commented 3 years ago

I would like not to have to pay that allocation cost and re-use a document instance each time I need to parse something.

A document in simdjson fits on the stack... I think it is about 40 bytes in total.

marmeladema commented 3 years ago

Sure but since I cannot expose C++ objects to a different language, I cannot put the document object in the stack since rust would have to know it's memory layout. That's why I had to hide all simdjson objects and access them only through opaque pointers.

If I understand correctly, this approach does not really fit the on-demand API?

lemire commented 3 years ago

You definitively can reuse the same document instance...

    auto json = "{\"coordinates\":[{\"x\":1.1,\"y\":2.2,\"z\":3.3}]}"_padded;
    auto jsonbad = "{\"coordinates:[{\"x\":1.1,\"y\":2.2,\"z\":3.3}]}"_padded;

    ondemand::document doc;

    ondemand::parser parser;
    parser.iterate(json).get(doc) == simdjson::SUCCESS;
    parser.iterate(json).get(doc) == simdjson::SUCCESS;
    parser.iterate(jsonbad).get(doc) == UNCLOSED_STRING;
    parser.iterate(json).get(doc) == simdjson::SUCCESS;
    parser.iterate(jsonbad).get(doc) == UNCLOSED_STRING;
    parser.iterate(json).get(doc) == simdjson::SUCCESS;
lemire commented 3 years ago

I would strongly recommend you avoid such constructions in C++...

context->document.~document();
context->document = simdjson::ondemand::document();
lemire commented 3 years ago

I would strongly recommend you avoid such constructions in C++...

And it should be entirely unnecessary.

marmeladema commented 3 years ago

I would strongly recommend you avoid such constructions in C++...

context->document.~document();
context->document = simdjson::ondemand::document();

Ok, I won't do it then :+1:

You definitively can reuse the same document instance...

    auto json = "{\"coordinates\":[{\"x\":1.1,\"y\":2.2,\"z\":3.3}]}"_padded;
    auto jsonbad = "{\"coordinates:[{\"x\":1.1,\"y\":2.2,\"z\":3.3}]}"_padded;

    ondemand::document doc;

    ondemand::parser parser;
    parser.iterate(json).get(doc) == simdjson::SUCCESS;
    parser.iterate(json).get(doc) == simdjson::SUCCESS;
    parser.iterate(jsonbad).get(doc) == UNCLOSED_STRING;
    parser.iterate(json).get(doc) == simdjson::SUCCESS;
    parser.iterate(jsonbad).get(doc) == UNCLOSED_STRING;
    parser.iterate(json).get(doc) == simdjson::SUCCESS;

If you look at the code example I gave and remove the 2 lines that are supposed to reset the document object in step 2, you should be able to reproduce the problem. In certain circumstances, the document is not re-usable because it fails to parse a perfectly valid document. Is it considered a bug then?

lemire commented 3 years ago

Is it considered a bug then?

It might be, yes. I am checking things out.

Note however that reusing the document should never be an issue.

It is possible that you have uncovered an issue. Let us see.

lemire commented 3 years ago

Ok, I won't do it then

Technically it works, it is just not idiomatic. In our case, they should not cause any harm.

lemire commented 3 years ago

Yep. You found a bug I think.

jkeiser commented 3 years ago

While we sort out this bug, a Rust lifetime thing :). You should be able to convert any strings you parse using simdjson into Rust &str without allocating (they are guaranteed utf-8 and have a pointer and length, just like str). simdjson stores the strings in the parser (which is basically doing bump allocation). So they have the same limitations as the document. All this means if you manage lifetimes right, and are willing to parse just one doc at a time, you can avoid any allocation whatsoever while parsing.

The input string must be retained as well, during iteration, but once you have finished iteration/parsing you don't need it anymore.

Forgive any syntax issues, I'm working off 2-year-old memory of Rust:

impl Parser {
  // Parsed document (and any strings inside it) must be outlived by the mutable Parser ref
  // (and the input string as well, if the ParsedDocument does lazy iteration)
  Result<ParsedDocument<'a>, SimdjsonError> parse<'a>(parser: &mut 'a Self, json: &'a str);
}
lemire commented 3 years ago

Note that creating a whole new document instance solves the bug... I am not sure yet where the bug is.

jkeiser commented 3 years ago

@lemire I'm assuming the failure is if the first JSON doc has an error, and the next does not? (Just because I have trouble imagining we don't have a test that reuses the document somewhere--I recall writing them, even.)

marmeladema commented 3 years ago

Doing a memset to 0 of the document object also "fixes" the bug but I am not sure if the document is actually left in a good state.

jkeiser commented 3 years ago

@lemire the issue is that the error field doesn't get copied over when you copy a json_iterator: https://github.com/simdjson/simdjson/blob/master/include/simdjson/generic/ondemand/json_iterator-inl.h#L5-L20

jkeiser commented 3 years ago

So when you start the next iteration it still thinks there is an error (the document inside the parser object has error = SUCCESS, but we didn't copy it and left the original error around).

jkeiser commented 3 years ago

@marmel the memset is an ok workaround in this case, since it sets the error to 0 (SUCCESS) and we're overwriting everything else when you call iterate().get()--not that I would recommend the workaround to anyone else).

lemire commented 3 years ago

@jkeiser I did figure it out eventually.

lemire commented 3 years ago

@jkeiser This is a really nasty bug. I was assuming that...

doc = ondemand::document();

would clear the document... surely it would, right? Nope. One field was left unchanged (sticky). Damn it.

I have added a to_debug_string() field to the document instances so we can better figure things out in the future.

(This does not show up at all when logging.)

lemire commented 3 years ago

I will issue a patch release since it can be kind of a nasty bug.