Closed bsless closed 2 months ago
This adds support for reading and writing large sequences without materializing them in memory. Few choices I made I'm not sure about:
.writeValuesAsArray
and not writeValues
, but I wanted ensure equivalency between input and output sources@ikitommi do you know why the work flow failed when setting up the environment? The error is from tar
, of all things.
Did I do something wrong or does it need to be rerun?
Is the read-values
for just reading e.g. top level Array elements?
I recently implemented following example, where I used readTree
to get one property from top level object, and then create lazy seq from items in that property. Not sure if this uses streaming, but I'm quite sure this prevented storing the whole array in a vector (https://github.com/metosin/jsonista/blob/master/src/java/jsonista/jackson/PersistentVectorDeserializer.java):
(defn read-geojson-features
"Try reading geojson file without loading full features array to memory"
[^Reader f]
(let [^JsonNode tree (.readTree json/default-object-mapper f)
^JsonNode node (.get tree "features")]
(->> (map (fn [node]
(.treeToValue json/default-object-mapper node ^Class Object))
node))))
Is it possible to use streaming in such cases? If not, maybe preventing creating vectors for big Arrays is an separate issue.
@Deraen, thanks for giving it a look
read-values
is just for reading to level Array elements, as it's the context in which I considered streaming. I don't know if maps are "streamable" in that sense. Looks like JsonNode does expose a method for an Iterator of Map.Entry, so yes.
The main difference in your implementation is that .readTree
seems to be eager, just keeps an internal node representation instead of mapping into an external object. In that sense it is lazy. If you return a reducible/iterator instead of map over the node it will be even more lazy.
The use case I was trying to solve is one where you already know you're going to be reading a large array or dealing with some stream of data.
We can divide the possible solutions into three degrees of laziness:
I think these solutions are fundamentally different. I don't think lazy streaming could be generalized beyond 3, but partial laziness like your solution is an avenue to explore. These are separate issues, use cases and requirements, in my estimation.
Yeah, that makes sense.
I'll try to look a bit more into case 2, if there is still something that will be shared with this case. Before introducing new API here, I want to understand if we could cover both cases with similar functions.
Maybe I'll need to read JsonNode impl, or profile memory use with readTree.
It is possible to also use stream reading to read values from an array inside an object: https://github.com/metosin/jsonista/compare/stream-testing https://cassiomolin.com/2019/08/19/combining-jackson-streaming-api-with-objectmapper-for-parsing-json/
One just needs to navigate the parser to the array start token first.
I guess lazy-seq is doing some caching so the example is not optimal, but didn't quickly find better way to call .readValueAs until the END_ARRAY token is found.
I don't think we need to provide functions to move the parser, but maybe something to make easier to efficiently read array values once the parser is in correct position?
Wrap-values is currently private, and that would be useful if a user wants to call e.g. readValuesAs
themselves. Is that fn needed because the Iterators from Jackson don't implement Iterable themselves?
What's the difference with wrap-values and clojure.core/iterator-seq
? Chunking? Though an Iterable is turned to seq with the same method.
@Deraen not exposing a seq api over read-values
was intentional. It returns something very similar to an Education. A user can always transform it to a lazy-seq and get everything associated with it but the other way around? not so much. Lazy seqs just create data buffers in memory. I want to be able to stream data from input to output directly. Imagine reading a byte stream with read-values and writing it out with write-values. No intermediary allocations or buffering, directly bytes to bytes (or stream to stream). This implementation returns an iterable, you can wrap in with an Education which is also an iterable, then write it out with write-values.
read-values dispatches to the ReadValues protocol. It returns an iterator via an ObjectReader derived from the supplied mapper. The returned iterator is reified in a manner similar to Eduction to support reduction and sequence construction over it.
write-values relies on two protocols - WriteValues for the output destination, similarly to WriteValue, and WriteAll for the type being written, which can be an array or an Iterable. It writes an array or iterable to destination via a SequenceWriter. Importantly, write-values distables automatic flushing on serialization to get good performance.