OOMs on new application package deployments

rahul-muttineni-okcupid commented 2 years ago

Describe the bug I was doing experiments where I included enough ML models to increase the application package size from 1.5MB to 40MB. The deployment of the 40MB package failed but it seemed that the update still got picked up because I observed the generation number bumped in the config server logs. The call to the activate application endpoint failed because the config server OOM'd and it took a long time to propagate the update so it timed out. Since then, we've reverted the change (not including the ML models anymore), but we're still getting OOMs on every deploy causing massive heap dumps (up to 2GB in size) eventually causing the file system to run out of disk space and us having to delete the heapdumps. We've deployed the config server with a max heap size of 2GB.

To Reproduce Perhaps deploy a large application package over and over again for 115 generations (probably less generations should do it) with a configured max 2GB heap size.

Expected behavior I expect memory usage of the config server not to increase over each deploy especially when the newer package sizes are smaller.

Screenshots java_pid274_Leak_Suspects.zip I've ran Eclipse's MAT tool against the heap profile and attached the generated report in case it helps.

Environment (please complete the following information):

OS: Linux
Infrastructure: AWS / self-hosted
Versions: 7.329.19

Vespa version 7.329.19

Additional context Based on the leak suspects I see mentions of zookeeper which potentially indicates that zookeeper is keeping the full history of all application packages (in this case we're on generation 115) - is this expected behavior? And if so is the solution here to just bump up the max heap size or is there a way to discard old generations if we have no intention of going back to old versions?

kkraune commented 2 years ago

2G heap is a bit small for config servers, yes, we run with larger, and as you observed more important with larger packages. I am assigning to @hmusum the config server expert for details, and we will update https://docs.vespa.ai/en/operations/configuration-server.html accordingly. thanks for reporting!

rahul-muttineni-okcupid commented 2 years ago

@kkraune Thank you for the confirmation. To my other question - is there a way to delete / garbage collect some of the older generations?

hmusum commented 2 years ago

Garbage collection is based on time (generations older than sessionLifeTime (see https://github.com/vespa-engine/vespa/blob/master/configdefinitions/src/vespa/configserver.def#L28) will be deleted by a maintainer running every 30 seconds.

sessionLifeTime can be configured by adding a file /opt/vespa/conf/configserver-app/configserver-config.xml with e.g:

<config name="cloud.config.configserver">
  <sessionLifeTime>1800</sessionLifeTime>
</config>

vespa-engine / vespa

OOMs on new application package deployments #22853