threatgrid / ctia

Cisco Threat Intelligence API
Eclipse Public License 1.0
69 stars 26 forks source link

switch lazyseq to iteration #1443

Closed ereteog closed 1 month ago

ereteog commented 2 months ago

switch migration task to iteration instead of custom crawl approach.

§ QA

No QA is needed.

ereteog commented 1 month ago

Does this have any memory implications? This gets passed to seque which seems to behave the same based on my reading.

I tried to generate a memory problem and it didn't trigger any. However a very similar approach did generate a memory issue in IROH for reports and it actually appeared to have a memory leak issue during last migration. I did not rework on it since last time since I could not reproduce the memory issue, but it still bother me :-).

frenchy64 commented 1 month ago

I think they are similarly lazy, I'm guessing iteration is slightly leaner since it doesn't need to cache its own seq.

I ended up finding a bug which might be relevant: seque forces n+2 elements of the producing seq.

https://ask.clojure.org/index.php/14178/seque-forces-n-2-items-ahead-of-consumer-instead-of-n

(let [producer (fn [i]
                 (prn "producer" i)
                 (inc i))
      s (seque 1 (iteration producer :initk 0))]
  (Thread/sleep 1000))
;"producer" 0
;"producer" 1
;"producer" 2
nil

(let [producer (fn step [i]
                 (lazy-seq
                   (prn "producer" i)
                   (cons i (step (inc i)))))
      s (seque 1 (producer 0))]
  (Thread/sleep 1000)
  nil)
;"producer" 0
;"producer" 1
;"producer" 2
nil
frenchy64 commented 1 month ago

In addition, seque has a memory leak where it will hold onto the the n+1 and n+2 elements even after the seque is garbage collected. This is because it sends off an agent which has a known memory leak since it conveys itself in an *agent* thread binding https://clojure.atlassian.net/browse/CLJ-2619

ereteog commented 1 month ago

In addition, seque has a memory leak where it will hold onto the the n+1 and n+2 elements even after the seque is garbage collected.

That would explain a lot! I could not reproduce the issue with a similar test I did for another implementation that I rewrote with iteration. Now I think that I know what is the main difference, thx! The seque is in migrate-query that is called with a reduce here. What you said would explain what happened last time for events store, which migration showed a memory leak and have to be restarted several time (hopefully the process can restart where it stopped :-) ). The migration is splitted per week because we need to sort and sorting on weeks is fast while sorting the whole index is slow (of course :-) ). Since we migrated about 5 years of ~8 billion documents for events, it means that ~250 seque were not garbage collected :-)!

frenchy64 commented 1 month ago

Reported memory leak: https://ask.clojure.org/index.php/14185/memory-leak-in-seque-via-agents

frenchy64 commented 1 month ago

On reflection, we consume each seque entirely via reduce (right?). This means the agents that are leaking memory will not contain any migration data. The migration might have failed because our buffer size was too large. Now that it's effectively 1, if we still find memory issues we might need to reduce the size of the queries (and/or pagination?).

ereteog commented 1 month ago

we have 2 reduce

a reduce on each query (1 week of data) we have a seque on the lazyseq in migrate-query

 data-queue (seque buffer-size
                          (read-source read-params))
new-migrated-count (reduce #(write-target %1 %2 services)
                                   migrated-count
                                   data-queue)]

and we reduce over queries:

(reduce #(migrate-query %1 %2 services)
                 base-params
                 queries)                      

We did observe an increase of memory over hours, which deals with several queries.

In any case that's worth migrating to an idiomatic abstraction, right?

frenchy64 commented 1 month ago

In any case that's worth migrating to an idiomatic abstraction, right?

Yes, I prefer reducing over an iteration. It will at least reduce the possible explanations if we continue to see memory issues.