Closed nemanja-tosic closed 1 year ago
@ml054 i've updated the pull request best i could to support JSONL. I've added a JSONL flag to conventions so that the change is non-breaking.
Can you take a look at the TODOs i've left for conventions, because i'm not sure what to do there. Thanks!
Hi, @nemanja-tosic
Thanks for contribution. I scheduled that at the beginning of next week.
@ml054 do you have any availability the next couple of days? Would really like to get the changes. Thanks!
I admit we have busy period (end of year), but I will try to find spare time for review. I'm sorry for inconvenience.
@nemanja-tosic
objectKeysTransform is used to convert casing on object in one shot:
Under the hood it uses: ObjectUtil.transformObjectKeys
Which accepts object and options (target casing - usually camel, ignore paths).
I assume that performance of those might be improved by providing reviver to JSON.parse - since we do sync parsing here. When we are dealing with documents and we detect that we don't need to touch casing then we can reuse existing objects - it will save memory allocation.
For requests which doesn't operate on full objects we use: TransformKeysJsonStream. Indeed once we introduce jsonl we won't need this kind of transformation (since we can read line) and operate on single object and parse this in sync manner. However TransformKeysJsonStream has tied logic responsible for choosing proper casing transformation strategy based on conventions and current parsing position in stack.
For first step would be to emit one object in stream at the time and try to apply casing conventions using objectKeysTransform approach. (by converting profile see TransformJsonKeysProfile) into objectKeysTransform.
Thanks for that. Unfortunately i need some help here, the casing transforms are making things difficult for me to finish the PR and would appreciate it if you could do that part.
As a side note, i wasn't sure if conventions are the best way to go, i'm guessing that's a good way about it?
Yes conventions is centralized point where you can customize entity casing conventions. That the reason we need them during serialization/deserialization.
Moved to: #391
(Work in progress, trying to capture all the details right now, looking for feedback on the points).
Scope:
Scope:
Parsing JSONL
In stream-json it is possible to use JSON through the default Parser with the
jsonStreaming
flag, or with a dedicated JSONL Parser. In general, like the docs say, the JSONL Parser is more performant. For me, the JSONL parser was at least twice as fast locally. In order to utilize the potential gain from JSONL, i believe the JSONL parser must be used. Testing this locally, and my results might be wrong for an unknown reason, enabling thejsonStreaming
flag was actually worse than not usingjsonStreaming
.JSONL Parser, on the other hand, does not support the same filters and keeps the entire line in memory. I believe this is a good thing, given that the clients will always want to have at least one whole document in memory and that the default Parser is very slow when dealing with a large document set.
However, having the same object in memory means that we need to use the ObjectKeyCaseTransformStream for casing conventions. The library currently uses one of two stream Transforms to apply casing conventions: ObjectKeyCaseTransformStream (readable-stream) and TransformKeysJsonStream (stream-json). TODO - i'm not sure if the parsers are kept in sync, so if we need both, i'd probably figure out a way to only use one.
That being said, i'm not sure where the need for
parseAsyncJson
comes from when not streaming - even for a very large document the clients indicate that they will keep the entire document in memory when not using streaming so even if we assemble the document bit-by-bit while parsing json the end result is the entire object. My question would be would it be possible to:This negates the need for two keys transformers, because we will always be dealing with an entire document.
At any rate, we need to optimize applying casing conventions for JSONL, because it increases the amount of time needed to process a request by at least 50% according to my tests.
If easier i'm always up for a zoom call, so let me know if that would help with the process.