This PR changes how we use the GC in order to more tightly control GC pauses:
perform a full GC cycle after engine (re)configurations
stop GC during engine breathe loop
perform a single GC step every 100 breaths
add a timeline event that measures the latency of GC steps
Before:
(This will devolve into a sawtooth pattern)
After:
(This will end up looking flat)
Backstory:
We never had any problems with the GC, but recently when testing lwaftr I noticed some instances of packet drops within the first hour of runtime, and correlated them to some larger deltas in heap size. Hypothesis being that as long as the GC works in small steps (as is typical in steady state workloads) all is fine but when a step does too much GC work the pause becomes excessive and leads to drops.
Since these drops/pauses/deltas occurred only relatively shortly after startup I am assuming that the cause is the garbage produced by configuration combined with the GCs inability to split up its work into smaller steps.
I've tried messing around with various GC configuration knobs (LUAI_GCPAUSE, LUAI_GCMUL, GCSWEEPCOST, GCSWEEPMAX, GCSTEPSIZE) but have not managed to improve things this way.
I've also tried to do a "manual" full GC cycle after engine configuration, and while that does move a lot of "one-time" GC work out of the engine loop it didn't resolve the packet drops.
As for regressions, I've ruled out any changes from the last release (I get the same behaviors with Snabb Davion from earlier this year). So its not due to the luajit changes we recently pulled. Another untested idea is that #1490 changed our GC usage patterns enough to surface these drops.
This PR changes how we use the GC in order to more tightly control GC pauses:
Before:
(This will devolve into a sawtooth pattern)
After:
(This will end up looking flat)
Backstory:
We never had any problems with the GC, but recently when testing lwaftr I noticed some instances of packet drops within the first hour of runtime, and correlated them to some larger deltas in heap size. Hypothesis being that as long as the GC works in small steps (as is typical in steady state workloads) all is fine but when a step does too much GC work the pause becomes excessive and leads to drops.
Since these drops/pauses/deltas occurred only relatively shortly after startup I am assuming that the cause is the garbage produced by configuration combined with the GCs inability to split up its work into smaller steps.
I've tried messing around with various GC configuration knobs (
LUAI_GCPAUSE
,LUAI_GCMUL
,GCSWEEPCOST
,GCSWEEPMAX
,GCSTEPSIZE
) but have not managed to improve things this way.I've also tried to do a "manual" full GC cycle after engine configuration, and while that does move a lot of "one-time" GC work out of the engine loop it didn't resolve the packet drops.
As for regressions, I've ruled out any changes from the last release (I get the same behaviors with Snabb Davion from earlier this year). So its not due to the luajit changes we recently pulled. Another untested idea is that #1490 changed our GC usage patterns enough to surface these drops.