snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.97k stars 300 forks source link

engine/ptree.worker: stop gc in engine, trigger gc steps proactively #1519

Closed eugeneia closed 4 months ago

eugeneia commented 9 months ago

This PR changes how we use the GC in order to more tightly control GC pauses:

Before:

memory_gc_heap_bytes (This will devolve into a sawtooth pattern) rxdrop

After:

memory_gc_heap_bytes (This will end up looking flat) rxdrop

Backstory:

We never had any problems with the GC, but recently when testing lwaftr I noticed some instances of packet drops within the first hour of runtime, and correlated them to some larger deltas in heap size. Hypothesis being that as long as the GC works in small steps (as is typical in steady state workloads) all is fine but when a step does too much GC work the pause becomes excessive and leads to drops.

Since these drops/pauses/deltas occurred only relatively shortly after startup I am assuming that the cause is the garbage produced by configuration combined with the GCs inability to split up its work into smaller steps.

I've tried messing around with various GC configuration knobs (LUAI_GCPAUSE, LUAI_GCMUL, GCSWEEPCOST, GCSWEEPMAX, GCSTEPSIZE) but have not managed to improve things this way.

I've also tried to do a "manual" full GC cycle after engine configuration, and while that does move a lot of "one-time" GC work out of the engine loop it didn't resolve the packet drops.

As for regressions, I've ruled out any changes from the last release (I get the same behaviors with Snabb Davion from earlier this year). So its not due to the luajit changes we recently pulled. Another untested idea is that #1490 changed our GC usage patterns enough to surface these drops.

eugeneia commented 4 months ago

This approach masks where allocations actually come from, among other issues. Closing for now.