Open Daenyth opened 5 years ago
@Daenyth Hi, Gavin!
Thanks for the question!
If you are talking about an example from the TechEmpower benchmark then the usage of jsoniter-scala there is overkill.
According to the profiler reports, it spends less than 0.5% of CPU in the writeToArray
routine.
Using async-profiler you can attach to the http4s server and see clearly what is happening under a different workload. Possible for streaming of long lists of JSON values your parse method will be more suitable, but in any case efficiency should be measured.
I'm less concerned with CPU usage and more concerned with memory usage. I'm not necessarily suggesting that the streaming approach should be the only one, but rather that it would be good to have it available, for cases where the input stream might be too large to fit into memory. As far as I'm aware, the TechEmpower benchmark doesn't exercise massive streaming response bodies, but in practice (at least for me) it's not uncommon, especially if processing data from non-http sources, like a stored json file somewhere.
Currently, jsoniter-scala doesn't guarantee limited usage of memory when parsing to an arbitrary data structure.
There are configuration options which can limit values of bitsets, size of maps or disallow recursive data structured for derived codecs but it is not enough to solve the problem in general case.
scanJsonValuesFromStream
and scanJsonArrayFromStream
routines were designed for cases when you need parse data from trusted sources, while for other cases (when parsing from a buffer) user can (and should) control the size of the input.
I'm not so concerned about malicious input, I just want to, in the happy path, not load the entire input data stream into memory at once. Are the InputStream
-based methods not appropriate for that?
@Daenyth you can use the InputStream
-based methods for trusted input.
For better throughput pass ReaderConfig
/WriterConfig
to them with tuned preferred sizes of internal read/write buffers.
I think I have a similar use case. I have a file (in the GB ranges) where each line contains a complete JSON object of the same type (e.g. same case class) but each line can be of arbitrary length (e.g. it has a Map
). I'd like to lazily read from the file but it doesn't seem like readFromStream
supports this use case and scanJsonValuesFromStream
is meant to read the entire source in one pass.
EDIT: I think we're meant to create more complicated consumers for scanJsonValuesFromStream
to support the use case I described?
readFromStream(inputStream, ReaderConfig.withCheckForEndOfInput(false))
readFromStream(inputStream, ReaderConfig.withCheckForEndOfInput(false)) // Doesn't read the "2nd line"
@steven-lai Hi Steven! Please open a separate issue If your case is not related to fs2 integration.
I'm happy to help you in finding the best solution. A good starting point would be samples of your input and examples of data structures that you use to handle them after parsing.
As an example, we can define a manually written codec that returns scala.Unit
but accepts some call back function in a constructor to redirect parsed data structures to it. It will allow handling repetitive JSON structures that are nested in different ways (not just line-feed separated values or JSON arrays of values).
I looked at the http4s example, but it appears to read all inputs into memory rather than streaming.
This appears to work so far, but it would be nice to have official support that is maintained over time (and hopefully improved!).