Closed spearl closed 6 years ago
@spearl Thanks very much for reporting this. This bug is left here because the IP addresses were managed by external network controllers in production systems and there are few cases to cover these leakages.
To solve this, I prefer
add pod interface cleanup to the synchronous shutdown code path when the interface is deleted from leveldb
Would you like to contribute a fix for this bug? Otherwise we will fix it.
Thanks for getting back so quickly @gnawux. I put something together quickly ^ that I'd love any feedback on
Thank you @spearl will push it forward asap
To replicate: run, stop and/or remove the number of pods equal to the local address space Hyperd is using. The address space hyper uses on CentOS is
192.168.123.1/24
so 253 pods will cause this error.After that many pods have been launched hyperd is incapable of running new pods. The error message is
hyperctl ERROR: Error from daemon's response: no available ip addresses on network
When pods are stopped or deleted, their IPs should be freed to be used again but this does not end up happening.
The error message is coming from runv's
IPAllocator
here when it traverses all of its ip allocation and can't find anything available which means the IPs are not being properly freed.The IPs (and all other resources) are freed here but I believe that a race condition causes execution to never reach this point.
Right before this, here the pod
cleanup()
that's been triggered by the vm shutdown channel event checks if the pod is already stopped and quits with the idea that there is nothing to be done.I believe this is where the error is originating. The stop or remove command usually fully wraps up shutting down the VM and changing the pod into the
STOPPED
status before this code is even run.The synchronous shutdown code here does nothing to free the IP reserved by the pod but simply deletes everything in hyper's DB and marks the pod as stopped.
The solution is likely to either add pod interface cleanup to the synchronous shutdown code path when the interface is deleted from leveldb, or allow the async pod cleanup to carry on with its job even if the pod is shutdown.