hyperhq / hyperd

HyperContainer Daemon
http://www.hypercontainer.io
Apache License 2.0
1.98k stars 194 forks source link

Pods cannot launch due to no available IP addresses #733

Closed spearl closed 6 years ago

spearl commented 6 years ago

To replicate: run, stop and/or remove the number of pods equal to the local address space Hyperd is using. The address space hyper uses on CentOS is 192.168.123.1/24 so 253 pods will cause this error.

After that many pods have been launched hyperd is incapable of running new pods. The error message is hyperctl ERROR: Error from daemon's response: no available ip addresses on network

When pods are stopped or deleted, their IPs should be freed to be used again but this does not end up happening.

The error message is coming from runv's IPAllocator here when it traverses all of its ip allocation and can't find anything available which means the IPs are not being properly freed.

The IPs (and all other resources) are freed here but I believe that a race condition causes execution to never reach this point.

Right before this, here the pod cleanup() that's been triggered by the vm shutdown channel event checks if the pod is already stopped and quits with the idea that there is nothing to be done.

I believe this is where the error is originating. The stop or remove command usually fully wraps up shutting down the VM and changing the pod into the STOPPED status before this code is even run.

The synchronous shutdown code here does nothing to free the IP reserved by the pod but simply deletes everything in hyper's DB and marks the pod as stopped.

The solution is likely to either add pod interface cleanup to the synchronous shutdown code path when the interface is deleted from leveldb, or allow the async pod cleanup to carry on with its job even if the pod is shutdown.

gnawux commented 6 years ago

@spearl Thanks very much for reporting this. This bug is left here because the IP addresses were managed by external network controllers in production systems and there are few cases to cover these leakages.

To solve this, I prefer

add pod interface cleanup to the synchronous shutdown code path when the interface is deleted from leveldb

Would you like to contribute a fix for this bug? Otherwise we will fix it.

spearl commented 6 years ago

Thanks for getting back so quickly @gnawux. I put something together quickly ^ that I'd love any feedback on

gnawux commented 6 years ago

Thank you @spearl will push it forward asap