Open jmf0526 opened 4 years ago
@jmf0526 this is an interesting one, sorry for missing it for so long!
It sounds like what you're suggesting is that etcd has returned an error to Calico even though it successfully handled the request? As a result, we think it failed and attempt to clean up, leaving ourselves in an unexpected state where the allocation exists but the handle has been removed. Is that right?
If so I'll need to think about if there's anything we can do here. I don't know if we can just handle transport errors separately since we don't know know for sure if the request succeeded or not.
If you do hit this and want to get past it, you can clean up the allocation by hand using calicoctl ipam
@jmf0526 this is an interesting one, sorry for missing it for so long!
It sounds like what you're suggesting is that etcd has returned an error to Calico even though it successfully handled the request? As a result, we think it failed and attempt to clean up, leaving ourselves in an unexpected state where the allocation exists but the handle has been removed. Is that right?
Yes, that's exactly what I was trying to say.
If so I'll need to think about if there's anything we can do here. I don't know if we can just handle transport errors separately since we don't know know for sure if the request succeeded or not.
If you do hit this and want to get past it, you can clean up the allocation by hand using
calicoctl ipam
I also feel this can't be handled easily, we have already recovered the env by using calicoctl ipam, thanks for the tip.
Expected Behavior
Calico ipam can tolerate network fault when updating ip block to etcd store, assigned ip should never got leaked (can be released)
Current Behavior
I'm using annotation cni.projectcalico.org/ipAddrs to assign static ip to a pod, like
everything worked fine, and after the env had been running stably for a long time, the pod got evicted ( because of disk shortage on that node), the new pod stuck in ContainerCreating status, 'kubectl describe po' suggests that the specified ip is already assigned:
Possible Solution
I checked calico etcd data, ip 1771.177.198.2 is occupied by containerdfd42c6f630ec6d6e9fc6597d981ede59c5df245cebc0a27c676adeb4f9b15556, pod licc2-6755b8ffbd-89v67, etcd ip block :
but there is no corresponding ipam handle for k8s-pod-network .fd42c6f630ec6d6e9fc6597d981ede59c5df245cebc0a27c676adeb4f9b15556 in etcd.
The following log was printed in kubelet:
" transport is closing" suggests that there is network problem when updating ip block in etcdv3 store, but data actually got saved successfully.
The following log shows that calico-ipam didn't release the ip because it can't find the corresponding ipam hanle
After check the AssignIP function in ipadm.go, I think the reason for this problem is that when c.blockReaderWriter.updateBlock returns error, ipam handle would got deleted, but in this scenario ip block actually got updated. So when ipam trying to release ip, it can't find corresponding ip handle, then assigned ip would never got released.
https://github.com/projectcalico/libcalico-go/blob/d414dc7c1c754fd3fabf58d769e4c426df3c106f/lib/ipam/ipam.go#L795
Steps to Reproduce (for bugs)
I don't know how to reproduce this stably, but ip leakage can happen if you delete ipam handle manually.
Context
Your Environment