taoensso / nippy

The fastest serialization library for Clojure
https://www.taoensso.com/nippy
Eclipse Public License 1.0
1.04k stars 60 forks source link

freeze/thaw backed by a bytebuffer #140

Closed huahaiy closed 4 months ago

huahaiy commented 3 years ago

Unlike #95 that is talking about using bytebuffer as a custom data type, I would like to see an option for the freeze/thaw functions to be backed by a user supplied bytebuffer, rather than being limited to the internally hardcoded byte-array.

In many cases, using a bytebuffer directly is much faster than going through dataoutput/datainput interface via a bytearray stream, which is not a particularly optimized code path in JVM, see e.g. https://github.com/RoaringBitmap/RoaringBitmap/issues/319

In my use case of datalevin, working with bytebuffer directly would speed up datom storage and retrieval.

ptaoussanis commented 3 years ago

@huahaiy Hi Huaha Yang, thanks for bringing this to my attention - sounds promising!

Would be happy to see a PR for this 👍

refset commented 1 year ago

Hi :slightly_smiling_face:

We have been looking at this for XTDB recently in support of speeding up the ingestion pipeline and reducing unnecessary allocations. Specifically, we want to avoid the current need to thaw documents returned by the 'document-store' which then get immediately re-encoded/frozen into KVs bytebuffers for the 'index-store' (backed by RocksDB / LMDB etc.).

Instead the document-store could return a bytebuffer per document and from this XT should be able to construct the necessary KV bytebuffers by simply slicing and merging wrapped buffers (i.e. views with defined offsets and lengths) without any duplication or thawing at all.

In this branch, I have already extracted the necessary Nippy-internal codec information and created a get-len function that can satisfy our immediate requirements to avoid any thawing or copying: https://github.com/refset/xtdb/blob/df210146d1744b14c31fa29e994ac3932c54e8d5/core/src/xtdb/nippy_utils.clj

Note that we use Agrona extensively across XT already.

Do you have any feedback or thoughts on how this approach could perhaps evolve into a PR?

The capability to freeze to bytebuffers would also be useful but is not a current focus.

ptaoussanis commented 1 year ago

@refset Hi Jeremy-

I'm not expecting to have significant time this week to dig into this in detail. And heads-up that I'm not familiar with XTBD or Agrona off-hand.

Would it be possible to try give a simplified high-level (/ ELI5) explanation of:

The easier you can make this for me to follow, the likelier I'll be able to get you a quick response.

Cheers!

refset commented 1 year ago

What your objective is

Given an already frozen Nippy serialization held in an existing bytebuffer, I'm looking for a capability to parse through and extract some of the inner contents without actually thawing any of the values (to avoid allocations for objects that aren't strictly needed for anything), such that I can work with additional bytebuffers that hold wrapped slices of still-frozen nested values (i.e. not copying the underlying bytes either). For example, given an already frozen map, I would like to be able to locate and (potentially later) thaw only a specific value under a known key, if that key exists in that map, as demonstrated here.

How this relates to the current issue re: support for freezing to a user-supplied bytebuffer

I can't speak for the Datalevin project but I believe the overall goals are somewhat similar: a bytebuffer API would allow for memory to be re-used in tight loops and avoid creating unnecessary garbage. I can imagine that the initial scope of this issue for thawing might only require thawing from an entire buffer at a time, but I need something slightly more specific in addition, which is to be able to parse without thawing, so that I can later decide exactly which inner values I would like to thaw, if any. I am not currently looking for support to freeze to a user-specified bytebuffer.

What kind of API/functionality would you ideally want Nippy to expose

An API similar to the get-len function in my commit, which, given a buf and offset, could return the type and length. Note I haven't returned the type in that implementation currently, but on reflection since I've realised that I need it also.

I'm not familiar with [...] Agrona off-hand

The only reason I brought Agrona up was because I used it in the code I linked to. Specifically, Agrona provides an ergonomic API for working with on-heap and off-heap bytebuffers.

Thank you for the fast response :slightly_smiling_face:

refset commented 1 year ago

I suppose there's quite a lot of overlap here with https://github.com/ptaoussanis/nippy/issues/147 - much (all?) of this could happily done in userspace if the codec definitions in nippy.clj were introspectable/exposed somehow. Again, see that branch I mentioned for the ~small sections of nippy.clj I needed to copy across so that I could write my own get-len function - essentially just the type-id mappings and all the implied lengths (calculated by hand).

ptaoussanis commented 1 year ago

Hi Jeremy, thanks for the clarifications - that's helpful 👍

I suppose there's quite a lot of overlap here with https://github.com/ptaoussanis/nippy/issues/147 - much (all?) of this could happily done in userspace if the codec definitions in nippy.clj were introspectable/exposed somehow.

To be clear, I'd make a distinction between:

  1. Nippy's internal schema: mostly just the set of [byte-id type length] tuples.
  2. The encoding of base types as per java.io.DataOutput and optional compression/encryption.

Exposing a public view of the internal schema (1) should in principle be relatively straight-forward. As I understood it, #147 also concerns itself with (2) - which isn't Nippy specific, and potentially more of an undertaking depending on what the target platform offers.

For your particular use case- how far would it get you if Nippy core included something like a public nippy/type-ids, maybe with explicit length in byes?

Seems that'd allow you to cut out ~90% of your branch code, and not depend on any fragile implementation details?

refset commented 1 year ago

For your particular use case- how far would it get you if Nippy core included something like a public nippy/type-ids, maybe with explicit length in bytes? Seems that'd allow you to cut out ~90% of your branch code, and not depend on any fragile implementation details?

Agreed - I think that would work great :slightly_smiling_face:

ptaoussanis commented 1 year ago

@refset 👍 Created https://github.com/ptaoussanis/nippy/issues/151 for next steps on public nippy/type-ids.

Leaving this issue open specifically for custom bytebuffer support.

ptaoussanis commented 1 year ago

Just to summarise current status re: support for user-supplied bytebuffers:

ptaoussanis commented 4 months ago

Closing for inactivity as part of issue triage