rook-agent formatted an in use volume

jbw976 commented 6 years ago

Users @chaosaffe and @apinnecke reported their mongoDB data completely wiped out and it appears the rook-agent caused this to happen. The best guess at the repro steps right now are to delete a pod with a rook-block volume. The environment is also on very slow infrastructure which may be causing a timing issue to be surfaced.

Log and related objects can be found in https://gist.github.com/jbw976/3f354a51b80ff3fdea735d263b10d40a

The log file has a few notes of interest:

There appears to be an attach operation that gets called in the middle of on ongoing detach operation. I'm not sure why the Kubelet would make this call, but our fencing doesn't appear to be blocking it appropriately:
```
2018-03-06 22:26:25.047877 I | flexdriver: unmounting mount dir
...
2018-03-06 22:26:25.240220 I | flexdriver: calling agent to attach
...
2018-03-06 22:26:25.482528 I | ceph-volumeattacher: detaching volume
```

The unmap operation during detach fails because the device is busy, the device never gets unmapped:

failed to unmap image generic/pvc-5333da36-fcf5-11e7-9f4a-0201fa246d34: Failed to complete rbd: exit status 16. output: rbd: run_cmd(udevadm): exited with status 127
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

The 2nd attach operation thinks the device is not formatted:
```
22:26:25.865635   51246 mount_linux.go:404] Disk "/dev/rbd4" appears to be unformatted, attempting to format as type: "ext4" with options: [-F /dev/rbd4]
```
The Kubernetes mounter code that does this check uses lsblk -n -o FSTYPE <disk>, which is not clear to me how that would behave for a RBD device that failed to unmap and doesn't have any mounts on it. We should look further into that.

Since the k8s mounter believes the disk isn't formatted, it goes ahead and performs the format again, wiping out all the data, which appears to take almost 3 minutes to complete:

22:26:25.865635   51246 mount_linux.go:404] Disk "/dev/rbd4" appears to be unformatted, attempting to format...
22:29:24.347948   51246 mount_linux.go:408] Disk successfully formatted

We remove the attachment object from the CRD before we actually perform the detach. This may not be safe and could lead us to try to attach again when the detach hasn't actually happened: https://github.com/rook/rook/blob/master/cmd/rookflex/cmd/unmount.go#L89

jbw976 commented 6 years ago

@kokhang it would be good to get your take on this issue if you can find the time.

kokhang commented 6 years ago

This is an odd issue. Does the format happens every time an attach occurs? As mentioned by @jbw976 , the agent checks whether the disk is formatted by runing lsblk. Look here https://github.com/kubernetes/kubernetes/blob/release-1.9/pkg/util/mount/mount_linux.go#L540.

If the result is empty, then it means it doesnt have a filesystem. It would be interesting what the output of that command is.

Another thing that im thinking is do you have raw block volume feature on? https://kubernetes.io/docs/concepts/storage/persistent-volumes/#pod-specification-adding-raw-block-device-path-in-container

I dont think this could be an issue because if you dont specify the volumeMode, it defaults to Filesystem, where it formats your filesystem.

jbw976 commented 6 years ago

@travisn could you please take a look into this issue as well? The standard k8s mounter check to determine if a volume is already formatted does not have understanding of RBD at all. would ceph-volume perhaps be a better way for the rook-agent to make this check?

travisn commented 6 years ago

@jbw976 @kokhang independent of whether ceph-volume would be able to do better at detecting whether to format, why is the volumeattach deleted before the detach is actually performed as @jbw976 pointed out here? Shouldn't that happen after the successful detach to prevent another attach until it's actually detached?

amiga23 commented 6 years ago

K8s: 1.9.3 Rook: alpha/0.7.1

If I delete a pod of a Deployment, a new pod is created immediately. This new pod shows error because the volume is still mounted to the old pod. After some seconds the old pod is gone and the new pod enters running state ; with a freshly formatted volume.

This is reproducable.

If I scale the deployment to 0, wait till the old pod is deleted and then scale up to 1 again, the volume will not be formatted.

jbw976 commented 6 years ago

I could not reproduce this just now using by deleting the pod from the mysql deployment defined in https://raw.githubusercontent.com/rook/rook/master/cluster/examples/kubernetes/mysql.yaml. I created a hello file on the volume before deleting the pod and when the new pod came up with the volume attached, the hello file was still there.

This doesn't appear to be 100% reproducible, I was not able to reproduce with the following:

Rook v0.7.1 and Kubernetes v1.9.3
a recent master build of Rook v0.7.0-11.g43352df and Kubernetes v1.9.1

@travisn I just reviewed the attachment removal logic again, called here, implemented here. I think the tricky part here is:

the underlying volume can be mounted multiple times on the same node
when unmounting, it's not safe to detach the volume until all mounts are gone
the current implementation doesn't know, from a locking perspective, that it is safe to detach the volume until after it successfully removed the final mount.

We have a window in between removing the final mount and performing the detach successfully. This can be exacerbated by the detach failing and having to retry. When that happens, we have an volumeattachment object, but containing no attachment records. The agent thinks its perfectly safe to try to attach again because this looks the same as the case where the very first attachment attempt ever fails and a retry occurs, as noted in this comment.

Distinguishing between the following two cases needs to occur in order to deal with this issue:

A volume attachment attempt fails and gets retried: in this case there is a volumeattachment CRD that tracks the node/volume/pod, but there are no child attachment records yet.
The final mount on a volume gets unmounted OK but the detachment of the entire volume has not occurred yet: in this case, there is also a volumeattachment CRD that tracks node/volume/pod, but all child attachment records have been removed.

EDIT: actually, it may indeed be perfectly safe for the attachment to happen in this window, as long as the attachment was idempotent and respected the existing filesystem (i.e. it did not perform a format)! So perhaps having knowledge of the filesystem on an RBD volume is the right way to go here.

travisn commented 6 years ago

@jbw976 ok, to summarize this means that before we format and mount, we need to check if the volume already has a filesystem. The critical time to check would be here?

jbw976 commented 6 years ago

yes @travisn that block is where more knowledge of RBD would help reliably determine if its safe to mount. note that the k8s FormatAndMount function will do the format for you if it thinks it's safe, so we'd need to circumvent that some way.

also the logging in https://github.com/rook/rook/blob/master/cmd/rookflex/cmd/mount.go#L146 is misleading because that's saying a format always happens when that isn't true.

as @christianhuening pointed out, sometimes the k8s mounter recognizes the RBD filesystem and sometimes it doesn't. It's not clear yet why it behaves inconsistently.

christianhuening commented 6 years ago

I did some futher testing of the involved code in kubernetes and found that the code indeed recognizes the disk to be formatted on a node where the lsblk -o FSTYPE /dev/rbd? command doesnt actually return an actual FSTYPE. Here's what's happening:

First I ran lsblk on the node we're using:

icc-mars-node-03 ~ # lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0    7:0    0 295.7M  0 loop /usr
sda      8:0    0 233.8G  0 disk
|-sda1   8:1    0   200M  0 part
|-sda2   8:2    0  46.6G  0 part
|-sda3   8:3    0 619.9M  0 part
-sda4   8:4    0 186.3G  0 part /
rbd0   253:0    0    50G  0 disk /var/lib/kubelet/pods/39a56677-26dc-11e8-abf6-005056b706ed/volumes/rook.io~rook/pvc-d45c77ad-26da-11e8-abf6-005056b706ed
rbd1   253:16   0    15G  0 disk /var/lib/kubelet/pods/76675ee1-2867-11e8-b626-005056b793c9/volumes/rook.io~rook/pvc-22146104-170b-11e8-b0ba-005056b78eff
rbd2   253:32   0    10G  0 disk /var/lib/kubelet/pods/76b3fea4-2867-11e8-b626-005056b793c9/volumes/rook.io~rook/pvc-ad46adb9-2125-11e8-b0bb-005056b78eff
rbd3   253:48   0   250G  0 disk /var/lib/kubelet/pods/76e475fa-2867-11e8-b626-005056b793c9/volumes/rook.io~rook/pvc-2368ab7c-170b-11e8-b0ba-005056b78eff
icc-mars-node-03 ~ # lsblk -o FSTYPE
FSTYPE
squashfs

vfat
hfsplus
hfsplus
ext4

icc-mars-node-03 ~ # lsblk -o FSTYPE /dev/rbd3
FSTYPE

icc-mars-node-03 ~ #

So for a mounted and used volume at /dev/rbd3 lsblk doesn't see an acrtual FSTYPE. Yet the following code snippet returns a result, which is interpreted as the drive being formatted:

package main

import (
    "strings"
    "fmt"
)

func main() {
    outputFromANode := `FSTYPE
`
    result, err := GetDiskFormat(outputFromANode)
    if err != nil {
        panic(err)
    }
    fmt.Println("result: " + result)
}

// Gode from https://github.com/kubernetes/kubernetes/blob/release-1.9/pkg/util/mount/mount_linux.go#L546
func GetDiskFormat(output string) (string, error) {

    // Split lsblk output into lines. Unformatted devices should contain only
    // "\n". Beware of "\n\n", that's a device with one empty partition.
    output = strings.TrimSuffix(output, "\n") // Avoid last empty line
    lines := strings.Split(output, "\n")
    if lines[0] != "" {
        // The device is formatted
        fmt.Println("the device is formatted!")
        return lines[0], nil
    }

    if len(lines) == 1 {
        // The device is unformatted and has no dependent devices
        fmt.Println("the device is NOT formatted!")
        return "", nil
    }

    // The device has dependent devices, most probably partitions (LVM, LUKS
    // and MD RAID are reported as FSTYPE and caught above).
    return "unknown data, probably partitions", nil
}

Output

the device is formatted!
result: FSTYPE

Process finished with exit code 0

So essentially an output with just FSTYPE\n but empty actual types returns true for that code.

EDIT: As was already mentioned in K8s 1.10 the usage of lsblk was dumped in favor of blkid(https://github.com/kubernetes/kubernetes/blob/release-1.10/pkg/util/mount/mount_linux.go#L562). The question remains though, when under the above circumstances a drive would be formatted?!

travisn commented 6 years ago

@christianhuening did you truncate the variable outputFromANode in the post? It looks like GetDiskFormat would simply return the string FSTYPE since it is the first line of the text passed to the method.

christianhuening commented 6 years ago

@travisn no, that is what lsblk outputs on my ContainerLinux nodes with mounted and working /dev/rbd? devices. Just

FSTYPE
<newline>

rootfs commented 6 years ago

@christianhuening per https://github.com/kubernetes/kubernetes/pull/59050#issue-166035953, you need to install udev rules https://github.com/ceph/ceph/blob/v13.0.0/udev/60-ceph-by-parttypeuuid.rules

jbw976 commented 6 years ago

@christianhuening this looks like this won't reproduce anymore on Kubernetes 1.10, have you guys upgraded yet to 1.10? If you have not yet, is it feasible to install the udev rules as @rootfs suggested? Thanks!

christianhuening commented 6 years ago

@jbw976 Yes we updated to 1.10 by now. It's working and it was working with 1.9.x as well. What i was wondering is WHY it was working with the output generated from 'lsblk' as there was no real information in it and the code basically just checked for the line with "FSTYPE" being present.

christianhuening commented 6 years ago

hmm so I guess we've got to put that one into the mystery section? Or are there anymore ideas?

mickymiek commented 6 years ago

It seems that we encountered a similar issue. It happened a couple time in a span of ~1 month. Deleted a pod, lost all data that was supposed to be mounted. It usually works fine, and when we retried deleting the same pod later it worked as usual, data was kept. This is a bit scary TBH. Here's the relevant logs from rook-agent: https://gist.github.com/mickymiek/9875aab127ffdd9fbf2a1fef47b021e2 We're using k8s v1.10.0 and rook v0.7.1. If anyone has an idea... Thanks in advance!

Edit: I did reproduce the issue on another pod. Each time we lose our data, we get this error: 2018-04-17 13:41:35.745120 E | flexdriver: Detach volume from /var/lib/kubelet/pods/6071ef5b-3bd1-11e8-b502-0cc47ab4037e/volumes/rook.io~rook/pvc-606ca30e-3bd1-11e8-b502-0cc47ab4037e failed: Failed to detach volume block/pvc-606ca30e-3bd1-11e8-b502-0cc47ab4037e: failed to detach volume block/pvc-606ca30e-3bd1-11e8-b502-0cc47ab4037e cluster rook. failed to unmap image block/pvc-606ca30e-3bd1-11e8-b502-0cc47ab4037e: Failed to complete rbd: exit status 16. output: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

travisn commented 6 years ago

After spending more time on this, I haven't been able to repro it or find the solution. When a pod is deleted, the fencing is preventing the pod from being mounted again until it is ready. I did observe that lsblk is not as reliable at detecting whether the volume has been formatted. At least from the toolbox pod, lsblk shows blank info for the FSTYPE, while blkid shows the correct file system.

I would have expected K8s 1.10 to improve this situation, but since it has been hit on K8s 1.10 by @mickymiek, we see that the error condition is also hitting with blkid. @christianhuening did you repro or try your code snippet on 1.10?

christianhuening commented 6 years ago

@travisn No, the code in 1.10.x has changed and hence i cannot repro it anymore. Another suggestion by @chaosaffe was that maybe the volumes in the error cases don't get fully mounted in the first place but the system thinks so. All writes would then go to the local host and thus essentially into thin air upon pod restart. But I haven't tested his theory yet.

travisn commented 6 years ago

@mickymiek do you still have this cluster running by chance? Or can you let me know if you hit this again? I'm looking for a repro on 1.10. Thanks

mickymiek commented 6 years ago

Hi @travisn , sorry to keep you waiting. We now have a cluster running on 1.10.1, we didn't encountered this problem since my last comment, but I haven't actively tried to reproduce. Will do tests tomorrow, I'll let you know about the results.

travisn commented 6 years ago

thanks @mickymiek for working on the repro!

mickymiek commented 6 years ago

I tried to reproduce many times without success. We're now running a new cluster (different machines). So I ran tests to see if disks change could have impacted the results, but it doesn't seem like it. So what changed since the last time we lost data is our k8s upgrade (1.10.0 -> 1.10.1) and the coreOS version we're using (1745.0.0 -> 1758.0.0).

I'll keep in touch if I happen to reproduce, but everything seems stable for now. @travisn

travisn commented 6 years ago

@mickymiek thanks for looking into it, we will keep this issue active for now to confirm that it is stable on k8s 1.10.

travisn commented 6 years ago

@christianhuening can you confirm if you are seeing this issue on 1.10?

christianhuening commented 6 years ago

@travisn I did not see that issue on 1.10 again.

travisn commented 6 years ago

Closing this for now since we have not had a report of it since 1.10. Please reopen if we get a repro on 1.10 or newer

cffan commented 5 years ago

I'm using rook 0.7.1. The same issue happened on a node with kubelet 1.10.3:

2018-06-22 09:42:23.406885 E | flexdriver: Detach volume from /var/lib/kubelet/pods/7f14e117-7531-11e8-8ab9-74d4358cc8ef/volumes/rook.io~rook/pvc-7f13c65c-7531-11e8-8ab9-74d4358cc8ef failed: Failed to detach volume block-storage-pool/pvc-7f13c65c-7531-11e8-8ab9-74d4358cc8ef: failed to detach volume block-storage-pool/pvc-7f13c65c-7531-11e8-8ab9-74d4358cc8ef cluster rook-block. failed to unmap image block-storage-pool/pvc-7f13c65c-7531-11e8-8ab9-74d4358cc8ef: Failed to complete rbd: exit status 16. output: rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

The cluster's master is running kubelet 1.9.4. The pod using this volume restarted somehow. However, I cannot reproduce it by by deleting pod or scaling deployment...

Entire log is here

sinqinc commented 5 years ago

Yesterday i got that issue with an rbd volume with a lot of small files. If the pod Is rescheduled docker completly freeze and the node become « unavailable » for minutes and then the volume got formated . It happen every Time with only that volume. It’s a 300gb volume filled with ~50kb files (probably 4 Millions files)

christianhuening commented 5 years ago

@sinqinc which version of rook are you using?

sinqinc commented 5 years ago

@christianhuening 0.8.1

4 Nodes CoresOS (stable channel) Rancher 2.0.7

cluster: id: 8157e138-3969-4fbb-bed3-6a74c77432c6 health: HEALTH_OK

services: mon: 3 daemons, quorum rook-ceph-mon15,rook-ceph-mon17,rook-ceph-mon11 mgr: a(active) mds: share-1/1/1 up {0=share-564b7c4664-jprdn=up:active}, 1 up:standby-replay osd: 16 osds: 16 up, 16 in

data: pools: 4 pools, 740 pgs objects: 170k objects, 415 GB usage: 458 GB used, 476 GB / 934 GB avail pgs: 740 active+clean

io: client: 21330 kB/s rd, 135 mB/s wr, 1293 op/s rd, 2954 op/s wr

jbw976 commented 5 years ago

We need to reproduce this behavior, potentially following the scenario described in https://github.com/rook/rook/issues/1553#issuecomment-415428583:

It’s a 300gb volume filled with ~50kb files (probably 4 Millions files)

christianhuening commented 5 years ago

@sinqinc are you still having that problem? What did you do to overcome / work around it?

jlaw90 commented 5 years ago

I just wanted to chime in and say that I've had this issue multiple times and lost data - I'm going to be moving back to longhorn for a while until this has been resolved.

I've always managed to do it the same way (and still haven't learnt):

Create a deployment with a pod and large volume provisioned by rook, set the policy to delete all pod replicas before recreating (otherwise it will hang)
Wait until it is running and all is good
Delete the pod (not the deployment)

Sadly I don't have any current logs, but what happens last is:

rbd returns an error rbd: unmap failed: (16) Device or resource busy
rook then formats the device

I'm currently debugging with the rook-tools pod and it looks like some of the old volumes are still in rbd, so it may be possible to retrieve the data - I'll let you know my findings.

sinqinc commented 5 years ago

@christianhuening We still have this issue but it doesn't happen every time like I thought. I opened another issue some time ago and it seems to a relation between them #1828

christianhuening commented 5 years ago

@jlaw90 thanks! can you also provide your versions of Kubernetes, Operating System and rook?

jlaw90 commented 5 years ago

@christianhuening sorry, should have included that.

Kubernetes version is v1.10.3 Operating System is Ubuntu 16.04 (xenial) rook version is v0.7.1

I've spent a bit of time investigating and there are quite a few items showing in rbd ls that I'm unable to map and see what's inside - also the sizes reported by rbd for mapped volumes don't seem to match the sizes reported by df on the host node. I'll keep digging and let you know anything else I find...

travisn commented 5 years ago

We believe this will be fixed by #2121 since Rook was not picking up the 1.10 version of the k8s client packages. Updating the client libraries is a rather large change so we will need to let it stabilize in master before we look at backporting.

christianhuening commented 5 years ago

@travisn oh that would be great! any help needed?

rook / rook

rook-agent formatted an in use volume #1553