Open cmu-rgrempel opened 5 months ago
I did some more investigating today.
First, it isn't absolutely clear to me any longer that the problem has something to do with the order in which pods are deleted. At one point today, I could see the problem when deleting pods in the expected order. So, now I'm thinking that it is possible that the problem occurs somewhat randomly when deleting pods.
Today, there were several occasions where a deleted pod would come back up, but winbind was clearly not actually working. (For instance, running id
inside the pod would produce odd, incomplete results).
I ran a tdbdump /var/lib/ctdb/persistent/secrets.tdb.X
, and noticed that in cases where pods came back up fine, the expected handful of key/value pairs were present there. In cases where the pods came up problematically, there were only two key/value pairs. One was __db_sequence_number__\00
, and the other was SECRETS/SID/CMU-FILESHARE-0
.
Reflecting on that, I noticed that while the secrets.tdb is said to be "persistent" (and in a directory called "persistent"), it was actually located in an "emptyDir" in the pod configuration. So, that directory was being deleted when I deleted the pod. I wondered whether it might make a difference to actually make that directory persistent. Initial results appear to suggest that this fixes the problem for me. At least, I've gone through several rounds of deleting pods, and they come back up for me consistently now (so far).
In any event, my current theory is that the problem related to the security.tdb somehow being incompletely restored by ctdb when a pod came back up. The fact that it sometimes occurred and sometimes didn't suggests some kind of race condition. But, I don't have enough actual knowledge of samba and ctdb to know whether those are sensible thoughts.
I'm trying clustered nodes with ctdb for the first time, using version 0.5. I have been able to get it to work nicely, with a cephfs backend. It's very pleasing!
I've been experimenting with failover etc. by randomly deleting the pods that the operator creates (to simulate evictions or node failures or whatever). What I'm experiencing is that if I delete the second pod (e.g., in my case, cmu-fileshare-1), then it comes back up in the expected way. However, if I delete the pods "out of order" -- that is, delete the first pod (in my case, cmu-fileshare-0), then the pod doesn't come back up successfully.
What I see from
kubectl get pods
is this:And what I see from
kubectl logs cmu-fileshare-0 -c wb
is this:I'm wondering whether this might be related to #262, which is another issue which may have something to do with the exact order in which nodes are brought up, and whether certain initialization steps are performed or skipped.
I'll dive into this further if I have time -- just thought I'd jot down the experience in case it is helpful to anyone.