luizperes / simdjson_nodejs

Node.js bindings for the simdjson project: "Parsing gigabytes of JSON per second"
https://arxiv.org/abs/1902.08318
Apache License 2.0
549 stars 25 forks source link

Performance warning? #42

Open elibarzilay opened 3 years ago

elibarzilay commented 3 years ago

(Apologies in advance for this issue being kind of all over the place...)

I'm playing with code that is processing a potentially large input which is basically a single array of small-ish objects (a performance dump). I saw #28, and therefore kept it only to uses of lazyParse: read one line at a time, lazy parse it, inspect one field, and dump the object back out.

The result is still significantly slower than JSON.parse. So it seems like just invoking the parser has some significant overhead, which makes this possibly related to #35 too--? If that's the case, then it would be useful to add this in the README so people would no that this thing is a good choice for parsing large objects, but it's not a good choice for lots of smaller objects.

Of course it would be even better if there was a way to parse large files without keeping the whole thing in memory. (I saw one closed issue about a streaming interface on simdjson, but unclear if it's implemented or not...)

luizperes commented 3 years ago

Hi @elibarzilay, thanks for opening this issue.

I'm playing with code that is processing a potentially large input which is basically a single array of small-ish objects (a performance dump). I saw #28, and therefore kept it only to uses of lazyParse: read one line at a time, lazy parse it, inspect one field, and dump the object back out.

I don't think it is related to #35. As explained on #28, the overhead is in the conversion from C++ objects to JS objects. The only feasible solution seems to be implementing that directly on the engines (such as V8).

[...] If that's the case, then it would be useful to add this in the README so people would no that this thing is a good choice for parsing large objects, but it's not a good choice for lots of smaller objects.

simdjson itself is faster for both types of objects, but simdjson_nodejs is not. Once we fix #28, that shuld be fixed, so I believe that having both things in the README would be redundant.

Of course it would be even better if there was a way to parse large files without keeping the whole thing in memory. (I saw one closed issue about a streaming interface on simdjson, but unclear if it's implemented or not...)

Can you tell me which issue was that?

elibarzilay commented 3 years ago

I don't think it is related to #35. As explained on #28, the overhead is in the conversion from C++ objects to JS objects. The only feasible solution seems to be implementing that directly on the engines (such as V8).

My experiment was lazy parse each line and testing just one field -- is the cost of converting one number high enough to dominate the runtime?

simdjson itself is faster for both types of objects, but simdjson_nodejs is not. Once we fix #28, that shuld be fixed, so I believe that having both things in the README would be redundant.

Well, regardless of whether it's the conversion or the parser initialization cost, as it stands, the README in its current state is misleadingly looking like I should be expecting some speed benefit. (I did get the fact that constructing the full objects would delay things, which means that the comparison with JSON.parse is not really relevant, but my use case is to inspect just one number which seems like a perfect use case for getting the speed benefits.)

Of course it would be even better if there was a way to parse large files without keeping the whole thing in memory. (I saw one closed issue about a streaming interface on simdjson, but unclear if it's implemented or not...)

Can you tell me which issue was that?

I don't remember and can't find it on a quick search, but anyway, I didn't see a way of running it in a way that gives you the elements of an array one-by-one. Is there?