Cache (references) - Githubissues

RokLenarcic commented 1 year ago

I see that nippy/cache is being deprecated... as an avid user of it, I question why is this change being made and what should I do with my code that uses it...

ptaoussanis commented 1 year ago

@RokLenarcic Hi Rok! Motivation is described at the relevant commit.

Basically, it's a complex feature with generally limited value and (as far as I was aware) little use outside of my own applications. The deprecation isn't in stone though, would be open to keeping it around if there's folks that find it helpful.

Could you describe your use case a little? What impact does the cache have in your case?

What I've found is that with compression enabled (the default), the benefits of caching tend to be small in most of the use cases I've observed.

ptaoussanis commented 1 year ago

I should add a detail not present in the commit message above: part of the motivation in removing caching was to help simplify Nippy's data format for folks potentially wanting to implement their own parts of Nippy, e.g. https://github.com/ptaoussanis/nippy/issues/151

It's become increasingly common over the years for folks to want to decode/parse/analyse Nippy payloads in some custom way. Anything I can do to simplify the format makes these kinds of tasks easier. I noticed that most people were just skipping support for cached vals since they're complex to implement and rarely used in practice.

Same point applies though: am definitely keen to hear from users of the feature. Deprecation can be abandoned, and/or some other alternatives provided.

RokLenarcic commented 1 year ago

I use it to compress representation of large datasets where certain column has very limited number of distinct values. E.g. if you have data that is git history of a whole repository, commit by commit. I use it on authors and sometimes file paths, as a repo might have like 50 authors and 10000 commits so same author string appears many times.

Now, I haven't really tested how much the saving is after the compression, you are probably right, LZ4 should already reduce the cost of duplication.

But there's another consideration: there's two metrics here, one is the size of byte array of serialized (frozen) data, so the size of data when I have it stored on persistent storage. Here enabling compression helps.

The other metric is size of data in heap when I thaw it. So when I thaw I want to all those references to point to same object on heap rather than a bunch of different objects. See:

(let [x (apply str (repeat 100 \A))]
  (mm/measure
   (nippy/thaw
     (nippy/freeze (repeat 100 x)))))
=> "14,8 KiB"

vs

(let [x (apply str (repeat 100 \A))]
  (mm/measure
   (nippy/thaw
     (nippy/freeze (repeat 100 (nippy/cache x))))))
=> "944 B"

This second metric is even more important, I don't care that much about persistent storage space as I care about the size on the heap. And compression doesn't help there.

ptaoussanis commented 1 year ago

Thanks for the detailed info, that's very helpful 👍

For completeness, would you mind comparing the size of serialized (frozen) data in your case with and without the cache?

If your main motivation is object identity of strings, I wonder whether interning thawed strings might be sufficient in your case [*]. Will experiment a bit when I'm next on Nippy work. (Am currently on a potential new open-source project, but it'd impact Nippy so will have the opportunity soon).

[*] Edit to clarify: there's a few options I have in mind, one would be something like (nippy/intern x) which'd act as a drop-in replacement for (nippy/cache x) to indicate that thawed strings should be interned, etc. Or this might be a reasonable default, and could offer a way to opt-out. Need to experiment.

RokLenarcic commented 1 year ago

(let [x (apply str (repeat 100 \A))]
  (alength
    (nippy/freeze (repeat 100 (nippy/cache x)))))
=> 208

(let [x (apply str (repeat 100 \A))]
  (alength
    (nippy/freeze (repeat 100 x))))
=> 66

ptaoussanis commented 1 year ago

Thanks Rok - but I'm looking for a more real-world comparison using actual data.

The case with caching will of course be smaller - the question is whether the difference in practice warrants the costs (in complexity, maintenance, etc.). Since you're concerned about heap size in your case, I'm supposing that you have a very large dataset - and it may be informative to know how much of a difference the caching makes there as a proportion of the total data size.

A ballpark estimate (e.g. sample) would be sufficient, e.g.: more or less than 1%? Actually, for the same reasons it would be nice to likewise have a memory comparison for the same (real-world) data.

RokLenarcic commented 1 year ago

I wouldn't recommend actually calling String/intern. A long running process like a server doesn't want to intern strings, and you cannot unintern strings, so those take up your memory forever. A single large dataset might have email "rok.lenarcic@email.com" repeated 1000s of times, the next dataset is from a different customer account and it has a different email repeated 1000s of times. And if I do String/intern then eventually the server will have in memory bits and pieces of data from every customer account, which it really doesn't need, because they don't ever appear in the same object graph, yet most of the heap will be spent on interned strings.

No, looking at this, it would be easy to implement using the xform you have in another issue:

(let [m (HashMap.)]
  (fn xform [x]
    (if (intern? x)
      (.computeIfAbsent m (unwrap x) (Function/identity))
      x)))

I'd just have to either roll my own wrapper intern type or just do it simply by Class (e.g. do it to all strings).

The neat thing here is that I am free to reuse same hashmap for multiple thaws and I am generally in control of how far the interning scope goes. You can write up a variant of this for the common case (interning works within the scope of the thaw) and chuck it into a namespace of its own so it doesn't pollute the nippy core

P.S. It's a shame clojure.core doesn't have a trivial wrapper that would add meta map to an object (e.g. wrapping string, long), because all library authors end up inventing one on our own. This is super annoying for ser/deser because you need to do custom support for each one.

RokLenarcic commented 1 year ago

Thanks Rok - but I'm looking for a more real-world comparison using actual data.

A ballpark estimate (e.g. sample) would be sufficient, e.g.: more or less than 1%? Actually, for the same reasons it would be nice to likewise have a memory comparison for the same (real-world) data.

I took a medium sized example of a git repository dataset. By using cache references the sizes are:

1.8 MB on hard drive, 1.9 MB for the other file
22,6 MB, 20,2 MB heap respectively

After disabling cache references:

1.9 MB, 2.5MB for files
25.1 MB and 48,8 MB heap

While this isn't spectacularly big, I've seen files in 100MB+ from some customers, so heap is likely a couple of GB for those datasets. As you can see difference can be anywhere from 15% to 100% bigger heap use.

RokLenarcic commented 1 year ago

A bigger example:

Files: 9.6MB, 25.8MB Heap: 77MB, 221MB

Disabled cache:

Files: 10.6MB, 33.5MB Heap: 96.2 MB, 514.1MB

If you provide some sort of a hook so I can roll my own interning, I am happy with that.

RokLenarcic commented 1 year ago

Frankly having #153 implemented would allow me to implement this myself. I think that #153 doesn't need to be a dynamic binding, it can be just an ordinary option passed into the thaw function.

ptaoussanis commented 11 months ago

Update: the cache feature will be retained for now, I'm reverting the removal.

Thanks again for pinging about this.

taoensso / nippy

Cache (references) #154