greglook / puget

Canonical Colorizing Clojure Printer
The Unlicense
251 stars 27 forks source link

Puget requires a large amount of memory when printing large collections #60

Open gfredericks opened 5 months ago

gfredericks commented 5 months ago

Minimal reproduction: (puget.printer/pprint (repeatedly 500000 #(rand-int 1000000000)))

When I run this with -Xmx200m and watch GC logs, I can see the memory usage gradually growing, and the program slows down as it approaches the maximum usage; with -Xmx500m it has enough headroom to finish in a more reasonable amount of time.

I'm fairly confident this is due to how this line interacts with the four-element vector produced here. In particular, because the vector is chunked, the map inside the mapcat will retain a reference to the third element (the large collection).

As evidence for this explanation, this change to fipp (unchunking the sequence) seems to fix it:

diff --git a/src/fipp/engine.cljc b/src/fipp/engine.cljc
index 8e6266d..9904a0e 100644
--- a/src/fipp/engine.cljc
+++ b/src/fipp/engine.cljc
@@ -12,7 +12,7 @@
 (defn serialize [doc]
   (cond
     (nil? doc) nil
-    (seq? doc) (mapcat serialize doc)
+    (seq? doc) (mapcat serialize (take 1e100 doc))
     (string? doc) [{:op :text, :text doc}]
     (keyword? doc) (serialize-node [doc])

I do not know what a clean fix for this would be. I'm not sure we can make the above change to fipp without potentially sacrificing performance in the farther-down-the-stack case where it's processing an actual large sequence, rather than a vector containing a large sequence as an element. And I can't think of anything that puget could do to cause fipp to behave differently.

cichli commented 4 months ago

It's from a while ago so I can't remember all of the details, but this fipp PR and the comments therein are related. I found that unchunking that exact same sequence prevented heap exhaustion on JDK8+.