Store parser directly instead of extracting document

jkeiser commented 4 years ago

This gets rid of the std::move(document) by just storing the parser. It obviously requires the parser to be new'd up. It definitely doesn't make up the difference and it's even pretty hard to tell if it helps--there's some things better and some things worse. But it could at least eliminate that as a source of confusion.

I'd love to see if other folks get more reproducible results out of this :/

This PR

filename	filesize (MB)	JSON.parse(ms)	simdjson.lazyParse (ms)	JSON.parse (GB/s)	simdjson.lazyParse (GB/s)	X faster
apache_builds.json	0.13	0.516	0.345	0.25	0.37	1.50
canada.json	2.25	26.683	10.003	0.08	0.23	2.67
citm_catalog.json	1.73	11.995	6.859	0.14	0.25	1.75
github_events.json	0.07	0.575	0.336	0.11	0.19	1.71
gsoc_2018.json	3.33	20.406	10.242	0.16	0.32	1.99
instruments.json	0.22	0.876	1.145	0.25	0.19	0.77
marine_ik.json	2.98	21.626	38.259	0.14	0.08	0.57
mesh_pretty.json	1.58	7.558	18.535	0.21	0.09	0.41
mesh.json	0.72	5.240	6.754	0.14	0.11	0.78
numbers.json	0.15	1.072	1.289	0.14	0.12	0.83
random.json	0.51	7.396	7.477	0.07	0.07	0.99
sf_citylots.json	189.78	2517.173	2196.714	0.08	0.09	1.15
twitter.json	0.63	7.515	5.221	0.08	0.12	1.44
twitterescaped.json	0.56	2.599	4.522	0.22	0.12	0.57
update_center.json	0.53	7.046	5.282	0.08	0.10	1.33

master

filename	filesize (MB)	JSON.parse(ms)	simdjson.lazyParse (ms)	JSON.parse (GB/s)	simdjson.lazyParse (GB/s)	X faster
apache_builds.json	0.13	0.520	0.326	0.24	0.39	1.60
canada.json	2.25	21.146	7.168	0.11	0.31	2.95
citm_catalog.json	1.73	9.960	6.146	0.17	0.28	1.62
github_events.json	0.07	0.480	0.324	0.14	0.20	1.48
gsoc_2018.json	3.33	16.266	9.655	0.20	0.34	1.68
instruments.json	0.22	0.707	1.104	0.31	0.20	0.64
marine_ik.json	2.98	20.904	22.975	0.14	0.13	0.91
mesh_pretty.json	1.58	6.048	7.506	0.26	0.21	0.81
mesh.json	0.72	4.118	7.391	0.18	0.10	0.56
numbers.json	0.15	0.903	1.210	0.17	0.12	0.75
random.json	0.51	6.169	5.681	0.08	0.09	1.09
sf_citylots.json	189.78	1886.632	1582.920	0.10	0.12	1.19
twitter.json	0.63	10.139	3.606	0.06	0.18	2.81
twitterescaped.json	0.56	2.237	2.889	0.25	0.19	0.77
update_center.json	0.53	5.976	4.634	0.09	0.12	1.29

luizperes commented 4 years ago

Hi @jkeiser,

I checked out your branch (locally) and the performance gets a little worse than before. Here is what the first item of the benchmark looks like:

with `std::move`

apache_builds.json#simdjson x 4,071 ops/sec ±11.20% (65 runs sampled) => 0.246ms

with this PR

apache_builds.json#simdjson x 2,843 ops/sec ±8.41% (60 runs sampled) => 0.352ms

what we we believe we can achieve

apache_builds.json#simdjson x 8,578 ops/sec ±0.56% (93 runs sampled) => 0.117ms

It seems that using a new on the parser is even more costly than before?

jkeiser commented 4 years ago

Yeah, it depends what file you're using. Some get worse, some get better. It actually might be real, and I have no explanation. I ran a few more times and threw out the highest results for each, and got this:

This PR

filename	filesize (MB)	JSON.parse(ms)	simdjson.lazyParse (ms)	JSON.parse (GB/s)	simdjson.lazyParse (GB/s)	X faster
apache_builds.json	0.13	0.516	0.345	0.25	0.37	1.50
apache_builds.json	0.13	0.520	0.345	0.24	0.37	1.51
canada.json	2.25	25.081	9.492	0.09	0.24	2.64
canada.json	2.25	24.650	10.805	0.09	0.21	2.28
citm_catalog.json	1.73	11.940	6.488	0.14	0.27	1.84
citm_catalog.json	1.73	11.588	6.507	0.15	0.27	1.78
github_events.json	0.07	0.575	0.336	0.11	0.19	1.71
github_events.json	0.07	0.583	0.330	0.11	0.20	1.76
gsoc_2018.json	3.33	20.303	10.065	0.16	0.33	2.02
gsoc_2018.json	3.33	18.715	7.681	0.18	0.43	2.44
instruments.json	0.22	0.876	1.145	0.25	0.19	0.77
instruments.json	0.22	0.874	1.479	0.25	0.15	0.59
marine_ik.json	2.98	21.626	38.259	0.14	0.08	0.57
marine_ik.json	2.98	20.934	29.120	0.14	0.10	0.72
mesh_pretty.json	1.58	7.558	18.535	0.21	0.09	0.41
mesh_pretty.json	1.58	7.542	16.423	0.21	0.10	0.46
mesh.json	0.72	5.240	6.754	0.14	0.11	0.78
mesh.json	0.72	4.772	7.805	0.15	0.09	0.61
numbers.json	0.15	1.072	1.289	0.14	0.12	0.83
numbers.json	0.15	1.107	0.887	0.14	0.17	1.25
random.json	0.51	7.396	7.477	0.07	0.07	0.99
random.json	0.51	7.192	9.362	0.07	0.05	0.77
sf_citylots.json	189.78	2517.173	2196.714	0.08	0.09	1.15
sf_citylots.json	189.78	2256.772	2363.868	0.08	0.08	0.95
twitter.json	0.63	7.515	5.221	0.08	0.12	1.44
twitter.json	0.63	5.951	6.967	0.11	0.09	0.85
twitterescaped.json	0.56	2.599	4.522	0.22	0.12	0.57
twitterescaped.json	0.56	2.798	3.508	0.20	0.16	0.80
update_center.json	0.53	7.046	5.282	0.08	0.10	1.33
update_center.json	0.53	7.062	5.715	0.08	0.09	1.24

master

filename	filesize (MB)	JSON.parse(ms)	simdjson.lazyParse (ms)	JSON.parse (GB/s)	simdjson.lazyParse (GB/s)	X faster
apache_builds.json	0.13	0.525	0.276	0.24	0.46	1.91
apache_builds.json	0.13	0.525	0.276	0.24	0.46	1.91
canada.json	2.25	24.391	8.598	0.09	0.26	2.84
canada.json	2.25	24.651	8.195	0.09	0.27	3.01
citm_catalog.json	1.73	11.660	7.404	0.15	0.23	1.57
citm_catalog.json	1.73	11.541	6.114	0.15	0.28	1.89
github_events.json	0.07	0.569	0.287	0.11	0.23	1.98
github_events.json	0.07	0.580	0.283	0.11	0.23	2.05
gsoc_2018.json	3.33	22.161	10.910	0.15	0.31	2.03
gsoc_2018.json	3.33	17.321	7.124	0.19	0.47	2.43
instruments.json	0.22	0.882	0.801	0.25	0.27	1.10
instruments.json	0.22	0.849	0.825	0.26	0.27	1.03
marine_ik.json	2.98	22.045	29.273	0.14	0.10	0.75
marine_ik.json	2.98	22.172	29.875	0.13	0.10	0.74
mesh_pretty.json	1.58	6.962	9.258	0.23	0.17	0.75
mesh_pretty.json	1.58	7.607	8.452	0.21	0.19	0.90
mesh.json	0.72	5.100	7.846	0.14	0.09	0.65
mesh.json	0.72	4.780	8.469	0.15	0.09	0.56
numbers.json	0.15	1.166	1.028	0.13	0.15	1.13
numbers.json	0.15	1.150	1.052	0.13	0.14	1.09
random.json	0.51	7.382	5.496	0.07	0.09	1.34
random.json	0.51	8.424	5.623	0.06	0.09	1.50
sf_citylots.json	189.78	2365.972	1848.542	0.08	0.10	1.28
sf_citylots.json	189.78	2458.404	1846.338	0.08	0.10	1.33
twitter.json	0.63	5.979	3.560	0.11	0.18	1.68
twitter.json	0.63	8.324	3.601	0.08	0.18	2.31
twitterescaped.json	0.56	2.922	4.038	0.19	0.14	0.72
twitterescaped.json	0.56	2.760	2.307	0.20	0.24	1.20
update_center.json	0.53	6.891	3.342	0.08	0.16	2.06
update_center.json	0.53	7.567	4.585	0.07	0.12	1.65

jkeiser commented 4 years ago

Regardless, this is what simdjson_nodejs used to do, and it is the only way to avoid std::move.

I don't recommend checking it in unless we see a more consistent win from it; I really think we want to be stealing the document from simdjson, at least to reduce memory pressure.

luizperes commented 4 years ago

I agree. Won't check it in for now. What is discussed on #35 might help this approach, since the cost now seems to be related to allocating a new parser.

jkeiser commented 4 years ago

what we we believe we can achieve

apache_builds.json#simdjson x 8,578 ops/sec ±0.56% (93 runs sampled) => 0.117ms

@luizperes where does this come from, BTW? Did simdjson_nodejs used to go this fast?

I went back and looked; the old code did std::move on the parser, which this change doesn't even do:

  ParsedJson *pjh = new ParsedJson(std::move(pj));
  Napi::External<ParsedJson> buffer = Napi::External<ParsedJson>::New(env, pjh,
    [](Napi::Env /*env*/, ParsedJson * data) {
      delete data;
    });

I know I must seem like some kind of std::move partisan here, I'm really not. I just suspect we're jumping at it because it's the most obscure thing in the code (which it definitely is!) :)

luizperes commented 4 years ago

@luizperes where does this come from, BTW? Did simdjson_nodejs used to go this fast?

@jkeiser I know it can (possibly) go that fast (and maybe even a little faster than that, will try to explain below). I got that value from the code below (using new document()):

Napi::External<dom::document> buffer = Napi::External<dom::document>::New(env, new dom::document(),
    [](Napi::Env /*env*/, dom::document * doc) {
      delete doc;
    });

simdjson_nodejs has three methods as I think you know (documented here): isValid, parse and lazyParse. While parse is slow, as explained on #5, isValid and lazyParse should have nearly the same cost, because they essentially do the same thing: validate the json, only the lazyParse keeps a reference to the document and exposes the fn valueForKeyPath.

Also, you see that, up to here, lazyParse is equivalent to isValid. When I benchmark the isValid instead of lazyParse, I get:

isValid

apache_builds.json#simdjson x 8,835 ops/sec ±0.54% (93 runs sampled) => 0.113ms // updated it, value was wrong)

Keeping a reference should be very fast, but as there is some extra work (of course), I believe that the cost of doing a new dom::document() should be close to our final expected result.

Let me know if you see something wrong in my explanation! :)

jkeiser commented 4 years ago

I know it can (possibly) go that fast (and maybe even a little faster than that, will try to explain below). I got that value from the code below (using new document()):

I see what you're saying. I don't know that we've made a small enough change to say it's the std::move though. There are big things the compiler could do if it figures out that all External instances have identical documents with nullptrs in them. In particular, it could potentially compile the destructor down to nothing. My supposition to this point has been that attaching a destructor function to Napi::External<> is the thing that makes things slow (a lot of GC languages optimize the crap out of things with no destructors). I think to compare apples to apples we need to actually fill in the document.

I pushed up a jkeiser/std-move-experiment branch of simdjson_nodejs here that removes std::move(). It also makes clear what I believe std::move is doing under the covers by removing unique_ptr. Now the document is just two plain old pointers, and construction copies the pointers and nulls the original document's pointers. Might be a good starting point for experimentation, at least.

I don't see a significant difference between this and master. Maybe you will, though!

For fun, you can also go back one commit on that branch, and see what difference std::move makes by itself (the previous commit just contains the changes to turn the document into raw pointers and get rid of the hidden work involved in unique_ptr).

While parse is slow, as explained on #5, isValid and lazyParse should have nearly the same cost, because they essentially do the same thing: validate the json, only the lazyParse keeps a reference to the document and exposes the fn valueForKeyPath.

Be careful here :) The compiler is quite capable of deleting huge amounts of code if it can figure out that you're not using its results. I once thought I'd doubled simdjson's speed with a change of mine, until I fixed a bug: I had forgotten to check the utf-8 validation's bool error when deciding whether to return an error result. At that point, the compiler simply didn't run utf-8 validation, even though it's deeply interwoven into the code! Well, I assume that's what happened; when I added if (utf8.error) { return UTF8_ERROR; } at the end of the parse, performance returned to normal.

I can't say that's what's happening here. But I can say that there's a lot of stuff the compiler could do given that it knows no one else could possibly be using the results of that parse.

luizperes commented 4 years ago

I can't say that's what's happening here. But I can say that there's a lot of stuff the compiler could do given that it knows no one else could possibly be using the results of that parse.

That is correct. It could be that after the compiler optimizations they were computationally equivalent codes. I will take a look at your experiments, thanks a lot!

luizperes commented 4 years ago

FYI @jkeiser,

when I do something like:

Napi::External<void> buffer = Napi::External<void>::New(env, static_cast<void *>(parser.doc.tape.release()),
    [](Napi::Env /*env*/, void * obj) {
      uint64_t *o = static_cast<uint64_t *>(obj);
      delete o;
    });

I get: apache_builds.json#simdjson x 7,290 ops/sec ±0.76% (93 runs sampled) => 0.137ms

(Not a document, but it looks like the computation is bound by the copying time)

luizperes commented 4 years ago

As for your branch jkeiser/std-move-experiment, is there a way for doc to be a pointer to a dom::document? (I didn't change because I am not sure if that would be accepted in the upstream)

luizperes commented 4 years ago

Complementing my question, it is just that copying by reference would be much faster than copying by value. I think that keeping new document(parser.doc) is still copying the document by value and that is why it is still slow. When I do &parser.doc (invalid, of course), I see said performance improvements. Even if the document was an unique_ptr, I think I would be able to work with it (such as my example above)

jkeiser commented 4 years ago

As for your branch jkeiser/std-move-experiment, is there a way for doc to be a pointer to a dom::document? (I didn't change because I am not sure if that would be accepted in the upstream)

We're talking about doing something like that. It's worth seeing what the performance is, at least!

Note: you shouldn't really make changes to the simdjson.h and simdjson.cpp you have here, except if you're just experimenting like this.

Complementing my question, it is just that copying by reference would be much faster than copying by value. I think that keeping new document(parser.doc) is still copying the document by value and that is why it is still slow.

Yes, it's copying by value. a reference copy would be a single word write. This involves a 2-word malloc and 4 word writes (2 of them to null out the old pointer).

Copying by reference will absolutely be faster. However, it seems really, REALLY unlikely that a 2-word malloc and 3 stores are dominating the runtime, especially given that they happen exactly once per parse. It seems more likely to be a caching or inlining effect. But hypotheses are cheap :) It will come down to experimenting and measuring until it's nailed down.

When I do &parser.doc (invalid, of course), I see said performance improvements. Even if the document was an unique_ptr, I think I would be able to work with it (such as my example above)

Yep. Again, this could either be because of the 2-word malloc and 3 extra word stores, or an effect on inlining / cache since the pointers now live in two different places over the life of the document, or something else we haven't thought of.

jkeiser commented 4 years ago

I'm going to close this so it doesn't accidentally get merged, but am happy to continue talking about it :) Note I haven't forgotten about bindings, but I plan to focus on making streaming parsing work for a little bit before I come back to it :)

luizperes commented 4 years ago

Sounds good, thank you @jkeiser

luizperes / simdjson_nodejs