Memory bloat - Githubissues

Have a basic little sharding in the cloud, with several edge VMs that have 2GB of memory.

I noticed as more request are handled, the memory kept growing until the VMs crashes and restarted.

This was traced back to:

enc, err := reedsolomon.New(5, 2, reedsolomon.WithInversionCache(false))
s.encoder = enc

verified, _ := s.encoder.Verify(buffers)

err := s.encoder.ReconstructData(buffers)

We ended up testing by having 55MB in 45 files (each with 5 data / 2 parity ) to simulate a specific load. So each request to ReconstructData only gets ~1.2 a 1.4MB of data, in a mix of 5 blocks (that can be a mix of data or parity ).

Disabling the actual reconstruction of data, shows a small spike to 600MB as it verifies. And then release the memory. So the verify function is perfectly fine.
BUT, if we do any reconstruction, we get a 600MB spike, drop back down (verify done), AND as its reconstruct (mem spike to 1.2GB) .. Stays at 1.2GB ... and stays at 1.2GB ... client receives the files ... stays at 1.2GB ...
So we tried only do a reconstruction ... mem spike to 1.2GB .. Stays at 1.2GB ... and stays at 1.2GB ...
If we try the code without any reconstruction and simply dump the data directly out, memory spikes (verify) 600, instantly down to normal.
If we try the code without any reconstruction and verify and simply dump the data directly out, memory stays stable in its idle state.

The Go GC needs to be trigger manually to clear out the memory or else, a second request with the same load, the memory spikes past 2GB and crashes the VMs.

We disabled the InversionCache, no effect beyond a small reduction in memory usage.

It takes about ~ 5+ minutes of waiting (idling) before the Go GC finally starts to release the memory. And yes, we are 100% sure the issue is Reconstruct / ReconstructData...

We even tried to nil the buffer values and the buffer to give the GC the best chance to GC, no dice. For some reason, it seem to hold on to the memory.

We also tried to not use a shared encoder and move it into the download routine, same issue....

As i wrote half this text, the memory finally dropped back down on this idle VM. Aka around 5+ minutes. Will do more testing tomorrow but we are sure its the ReconstructData.

klauspost / reedsolomon

Memory bloat #284