metosin / jsonista

Clojure library for fast JSON encoding and decoding.
https://cljdoc.org/d/metosin/jsonista
Eclipse Public License 2.0
422 stars 30 forks source link

Add read-values and write-values #53

Closed bsless closed 2 months ago

bsless commented 3 years ago

read-values dispatches to the ReadValues protocol. It returns an iterator via an ObjectReader derived from the supplied mapper. The returned iterator is reified in a manner similar to Eduction to support reduction and sequence construction over it.

write-values relies on two protocols - WriteValues for the output destination, similarly to WriteValue, and WriteAll for the type being written, which can be an array or an Iterable. It writes an array or iterable to destination via a SequenceWriter. Importantly, write-values distables automatic flushing on serialization to get good performance.

bsless commented 3 years ago

This adds support for reading and writing large sequences without materializing them in memory. Few choices I made I'm not sure about:

bsless commented 3 years ago

@ikitommi do you know why the work flow failed when setting up the environment? The error is from tar, of all things. Did I do something wrong or does it need to be rerun?

Deraen commented 3 years ago

Is the read-values for just reading e.g. top level Array elements?

I recently implemented following example, where I used readTree to get one property from top level object, and then create lazy seq from items in that property. Not sure if this uses streaming, but I'm quite sure this prevented storing the whole array in a vector (https://github.com/metosin/jsonista/blob/master/src/java/jsonista/jackson/PersistentVectorDeserializer.java):

(defn read-geojson-features
  "Try reading geojson file without loading full features array to memory"
  [^Reader f]
  (let [^JsonNode tree (.readTree json/default-object-mapper f)
        ^JsonNode node (.get tree "features")]
    (->> (map (fn [node]
                (.treeToValue json/default-object-mapper node ^Class Object))
              node))))

Is it possible to use streaming in such cases? If not, maybe preventing creating vectors for big Arrays is an separate issue.

bsless commented 3 years ago

@Deraen, thanks for giving it a look read-values is just for reading to level Array elements, as it's the context in which I considered streaming. I don't know if maps are "streamable" in that sense. Looks like JsonNode does expose a method for an Iterator of Map.Entry, so yes. The main difference in your implementation is that .readTree seems to be eager, just keeps an internal node representation instead of mapping into an external object. In that sense it is lazy. If you return a reducible/iterator instead of map over the node it will be even more lazy. The use case I was trying to solve is one where you already know you're going to be reading a large array or dealing with some stream of data. We can divide the possible solutions into three degrees of laziness:

  1. zero laziness: This is what current Jsonista supports
  2. partial laziness: This is your solution. It is slightly more general in that it allows querying the entire json structure and mapping over map entry pairs. Its downside is that it reads the entire data in to memory and creates the JsonNode tree. It may not be desirable in some cases, where the tree can be very large. It still does not create the intermediary Clojure objects. Writing a EQL parser which compiles to it could be interesting.
  3. full top-level laziness: My implementation, assumes the top level node is an array and exposes an iterator/reducible over it. its components are fully serialized.

I think these solutions are fundamentally different. I don't think lazy streaming could be generalized beyond 3, but partial laziness like your solution is an avenue to explore. These are separate issues, use cases and requirements, in my estimation.

Deraen commented 3 years ago

Yeah, that makes sense.

I'll try to look a bit more into case 2, if there is still something that will be shared with this case. Before introducing new API here, I want to understand if we could cover both cases with similar functions.

Maybe I'll need to read JsonNode impl, or profile memory use with readTree.

Deraen commented 3 years ago

It is possible to also use stream reading to read values from an array inside an object: https://github.com/metosin/jsonista/compare/stream-testing https://cassiomolin.com/2019/08/19/combining-jackson-streaming-api-with-objectmapper-for-parsing-json/

One just needs to navigate the parser to the array start token first.

I guess lazy-seq is doing some caching so the example is not optimal, but didn't quickly find better way to call .readValueAs until the END_ARRAY token is found.

I don't think we need to provide functions to move the parser, but maybe something to make easier to efficiently read array values once the parser is in correct position?

Deraen commented 3 years ago

Wrap-values is currently private, and that would be useful if a user wants to call e.g. readValuesAs themselves. Is that fn needed because the Iterators from Jackson don't implement Iterable themselves?

What's the difference with wrap-values and clojure.core/iterator-seq? Chunking? Though an Iterable is turned to seq with the same method.

bsless commented 3 years ago

@Deraen not exposing a seq api over read-values was intentional. It returns something very similar to an Education. A user can always transform it to a lazy-seq and get everything associated with it but the other way around? not so much. Lazy seqs just create data buffers in memory. I want to be able to stream data from input to output directly. Imagine reading a byte stream with read-values and writing it out with write-values. No intermediary allocations or buffering, directly bytes to bytes (or stream to stream). This implementation returns an iterable, you can wrap in with an Education which is also an iterable, then write it out with write-values.