taoensso / nippy

The fastest serialization library for Clojure
https://www.taoensso.com/nippy
Eclipse Public License 1.0
1.04k stars 60 forks source link

GC overhead limit exceeded when thawing after a freeze #136

Closed eihli closed 3 years ago

eihli commented 3 years ago

I have a ~140mb edn file with a hashmap of the following data. I read it in with (read-string (slurp ...)) then I froze it with (nippy/freeze-to-file ...). The freeze took about 10 seconds. The thaw then churns 100% cpu indefinitely and sometime completely crashes with the GC overhead limit exceeded error. It's intermittent though. No changes to the environment and I've sometimes gotten the GC error and other times I've let it run for ~10+ minutes before killing the process.

([("profanity" "unholy") {"its" 2}]
 [("ants" "triumph") {nil 1}]
 [("hiding" "our") {"of" 1, "expose" 3, "above" 1}]
 [("won't" "intervention") {"divine" 1, "an" 1}]
 [("pines" "weeping") {"the" 1}]
 [("let" "give") {"to" 1}]
 [("memory" "undead") {"an" 1}]
 [("waters" "one") {"as" 1}]
 [("that" "palms") {"the" 1, "these" 1}]
 [("you" "tonite") {"but" 1, "volume" 1}]
 ,,,)

Here is a link to the edn file https://www.dropbox.com/s/urby2ahcwp58l4f/dark-corpus-2.edn?dl=0

It takes about a minute to write/read it to disk as edn using read-string pr-str.

With nippy, it writes to disk in about ~10-15 seconds and takes up about 80mb.

Rather than upload the entire 140mb edn file, I tried to write a generator script that would reproduce the issue.

(defn rand-word []
  (let [num-letters (+ 4 (rand-int 4))
        rand-letter (fn []
                      (nth (map char (range 97 123)) (rand-int 26)))]

    (if (< 0.95 (rand))
      nil
      (apply str (take num-letters (repeatedly rand-letter))))))

(defn rand-kv []
  [(list (rand-word) (rand-word))
   (into {} (map (fn [_] (vector (rand-word) (rand-int 5))) (range (rand-int 3))))])

(defn nip-roundtrip []
  (let [data (into {} (take 5000000 (repeatedly rand-kv)))]
    (time (nippy/freeze-to-file "test.bin" data))
    (time (nippy/thaw-from-file "test.bin"))))

That generates about ~125mb of text data when it's pr-str, so very similar to what I'm working with. But with that, I'm not getting any errors. Nippy writes and reads it in about 15 seconds each.

So now I'm thinking there's something in my 140mb of edn that doesn't play nicely with nippy. Any idea what that might be?

The EDN was generated from text scraped from books and the web. Each key is a list of strings or nil. Each value is a kv pair of a string or nil and a number. It has been pr-str to an edn file and read back with (read-string (slurp ...)). Is there any reasonable way for that type of data to cause problems for nippy? Some particular encoded character or something?

ptaoussanis commented 3 years ago

@eihli Hi Eric! That's strange, no obvious reason I can think of that you'd see this behaviour. Will try take a look at your edn file this week and I'll come back to you.

eihli commented 3 years ago

I'll provide the results of my troubleshooting in hopes they'll help narrow things down.

Each individual block of 10,000 key/value pairs can be frozen/thawed without issue. The code below succeeds.

  (def darkov (read-string (slurp "resources/dark-corpus-2.edn")))
  (run!
   (fn [[index cur-map]]
     (let [cur-map (into {} cur-map)]
       (println)
       (time (nippy/freeze-to-file "test.bin" cur-map))
       (time (nippy/thaw-from-file "test.bin"))
       (println)
       (println (format "%f to %f are A-OK" (* index 1e4) (* (inc index) 1e4)))))
   (map vector (range) (partition 10000 darkov)))

There is a long delay every ~300,000 kv pairs frozen/thawed in chunks of 10,000. I assume this has to do with GC?

950000.000000 to 960000.000000 are A-OK
"Elapsed time: 25.711573 msecs"
"Elapsed time: 18.573112 msecs"
960000.000000 to 970000.000000 are A-OK
"Elapsed time: 2993.877507 msecs"
"Elapsed time: 18.416242 msecs"
970000.000000 to 980000.000000 are A-OK
...
"Elapsed time: 19.725587 msecs"
1290000.000000 to 1300000.000000 are A-OK
"Elapsed time: 25.213969 msecs"
"Elapsed time: 3036.215616 msecs"
1300000.000000 to 1310000.000000 are A-OK

Splitting the map in half results in the same infinite churn on thaw. Splitting the map into 4ths results in an ~8 second thaw.

  (def darkov-2-one-quarter
    (-> (count darkov)
        (/ 4)
        (take darkov)
        (#(into {} %))))
  (time (nippy/freeze-to-file "test.bin" darkov-2-one-quarter))
  (time (def foo (nippy/thaw-from-file "test.bin")))
eihli commented 3 years ago

And here is an overview from VisualVM. Not sure what I was looking for with this, but here's the result anyway. The spikes directly under 12:38 are from a thaw of 1/4 of the edn file. Then there is a gap in activity while I update the code for 1/2 the file. Then I tried to thaw that 1/2 of the edn file and the rest of the graph is the infinite churn.

jvm

eihli commented 3 years ago

I think this was all simply a heap size issue. I ran my REPL with -Xmx6g (default was 4g) and I was able to read from the file that was causing the issue before.

Confusing, because I never saw my used heap approach max heap, and it would usually churn indefinitely withut ever throwing an OOM exception, and I was only reading into memory from a few files of < 150mb and didn't expect (incorrectly, I now presume) that process to require 4+ gigs of heap. As for why the similar-looking data wasn't causing issues... perhaps my repl was in a different state and had fewer objects in memory at the time.

ptaoussanis commented 3 years ago

Thanks for the updates 👍

A couple thoughts / notes so far:

So basically I'd characterise this as two independent issues:

  1. Your data inherently consumes a lot of memory.
  2. Thawing/freezing adds additional memory overhead while thawing/freezing.

To help with 2. you can disable compression, and/or see if you might be able to do a streaming thaw/freeze. freeze-to-file currently freezes the whole payload up-front before writing. You could instead try use freeze-to-out! to freeze directly to file.

To help with 1., I'd try restructure your data if possible to reduce the very large number of persistent data structures. Since each structure adds overhead, it'd probably be better to have fewer larger structures than many small structures. And/or I'd consider non-persistent data structures, and/or a more compact/dense/memory-efficient representation in general.

Hope that's some help?

Cheers! :-)

henrishm commented 3 years ago

Thank you for the suggestions @ptaoussanis . I was able to leverage your suggestions by converting to java record types before freezing and back to Clojure persistent types after freezing. The improvement was absolutely amazing; less than a second to freeze from taking more than 10 minutes!