Closed jbw976 closed 5 years ago
@kokhang it would be good to get your take on this issue if you can find the time.
This is an odd issue. Does the format happens every time an attach occurs? As mentioned by @jbw976 , the agent checks whether the disk is formatted by runing lsblk
. Look here https://github.com/kubernetes/kubernetes/blob/release-1.9/pkg/util/mount/mount_linux.go#L540.
If the result is empty, then it means it doesnt have a filesystem. It would be interesting what the output of that command is.
Another thing that im thinking is do you have raw block volume feature on? https://kubernetes.io/docs/concepts/storage/persistent-volumes/#pod-specification-adding-raw-block-device-path-in-container
I dont think this could be an issue because if you dont specify the volumeMode
, it defaults to Filesystem, where it formats your filesystem.
@travisn could you please take a look into this issue as well? The standard k8s mounter check to determine if a volume is already formatted does not have understanding of RBD at all. would ceph-volume perhaps be a better way for the rook-agent to make this check?
@jbw976 @kokhang independent of whether ceph-volume would be able to do better at detecting whether to format, why is the volumeattach deleted before the detach is actually performed as @jbw976 pointed out here? Shouldn't that happen after the successful detach to prevent another attach until it's actually detached?
K8s: 1.9.3 Rook: alpha/0.7.1
If I delete a pod of a Deployment, a new pod is created immediately. This new pod shows error because the volume is still mounted to the old pod. After some seconds the old pod is gone and the new pod enters running state ; with a freshly formatted volume.
This is reproducable.
If I scale the deployment to 0, wait till the old pod is deleted and then scale up to 1 again, the volume will not be formatted.
I could not reproduce this just now using by deleting the pod from the mysql deployment defined in https://raw.githubusercontent.com/rook/rook/master/cluster/examples/kubernetes/mysql.yaml. I created a hello file on the volume before deleting the pod and when the new pod came up with the volume attached, the hello file was still there.
This doesn't appear to be 100% reproducible, I was not able to reproduce with the following:
v0.7.1
and Kubernetes v1.9.3
v0.7.0-11.g43352df
and Kubernetes v1.9.1
@travisn I just reviewed the attachment removal logic again, called here, implemented here. I think the tricky part here is:
We have a window in between removing the final mount and performing the detach successfully. This can be exacerbated by the detach failing and having to retry. When that happens, we have an volumeattachment object, but containing no attachment records. The agent thinks its perfectly safe to try to attach again because this looks the same as the case where the very first attachment attempt ever fails and a retry occurs, as noted in this comment.
Distinguishing between the following two cases needs to occur in order to deal with this issue:
EDIT: actually, it may indeed be perfectly safe for the attachment to happen in this window, as long as the attachment was idempotent and respected the existing filesystem (i.e. it did not perform a format)! So perhaps having knowledge of the filesystem on an RBD volume is the right way to go here.
@jbw976 ok, to summarize this means that before we format and mount, we need to check if the volume already has a filesystem. The critical time to check would be here?
yes @travisn that block is where more knowledge of RBD would help reliably determine if its safe to mount. note that the k8s FormatAndMount
function will do the format for you if it thinks it's safe, so we'd need to circumvent that some way.
also the logging in https://github.com/rook/rook/blob/master/cmd/rookflex/cmd/mount.go#L146 is misleading because that's saying a format always happens when that isn't true.
as @christianhuening pointed out, sometimes the k8s mounter recognizes the RBD filesystem and sometimes it doesn't. It's not clear yet why it behaves inconsistently.
I did some futher testing of the involved code in kubernetes and found that the code indeed recognizes the disk to be formatted on a node where the lsblk -o FSTYPE /dev/rbd?
command doesnt actually return an actual FSTYPE. Here's what's happening:
First I ran lsblk on the node we're using:
icc-mars-node-03 ~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 295.7M 0 loop /usr
sda 8:0 0 233.8G 0 disk
|-sda1 8:1 0 200M 0 part
|-sda2 8:2 0 46.6G 0 part
|-sda3 8:3 0 619.9M 0 part
-sda4 8:4 0 186.3G 0 part /
rbd0 253:0 0 50G 0 disk /var/lib/kubelet/pods/39a56677-26dc-11e8-abf6-005056b706ed/volumes/rook.io~rook/pvc-d45c77ad-26da-11e8-abf6-005056b706ed
rbd1 253:16 0 15G 0 disk /var/lib/kubelet/pods/76675ee1-2867-11e8-b626-005056b793c9/volumes/rook.io~rook/pvc-22146104-170b-11e8-b0ba-005056b78eff
rbd2 253:32 0 10G 0 disk /var/lib/kubelet/pods/76b3fea4-2867-11e8-b626-005056b793c9/volumes/rook.io~rook/pvc-ad46adb9-2125-11e8-b0bb-005056b78eff
rbd3 253:48 0 250G 0 disk /var/lib/kubelet/pods/76e475fa-2867-11e8-b626-005056b793c9/volumes/rook.io~rook/pvc-2368ab7c-170b-11e8-b0ba-005056b78eff
icc-mars-node-03 ~ # lsblk -o FSTYPE
FSTYPE
squashfs
vfat
hfsplus
hfsplus
ext4
icc-mars-node-03 ~ # lsblk -o FSTYPE /dev/rbd3
FSTYPE
icc-mars-node-03 ~ #
So for a mounted and used volume at /dev/rbd3
lsblk
doesn't see an acrtual FSTYPE.
Yet the following code snippet returns a result, which is interpreted as the drive being formatted:
package main
import (
"strings"
"fmt"
)
func main() {
outputFromANode := `FSTYPE
`
result, err := GetDiskFormat(outputFromANode)
if err != nil {
panic(err)
}
fmt.Println("result: " + result)
}
// Gode from https://github.com/kubernetes/kubernetes/blob/release-1.9/pkg/util/mount/mount_linux.go#L546
func GetDiskFormat(output string) (string, error) {
// Split lsblk output into lines. Unformatted devices should contain only
// "\n". Beware of "\n\n", that's a device with one empty partition.
output = strings.TrimSuffix(output, "\n") // Avoid last empty line
lines := strings.Split(output, "\n")
if lines[0] != "" {
// The device is formatted
fmt.Println("the device is formatted!")
return lines[0], nil
}
if len(lines) == 1 {
// The device is unformatted and has no dependent devices
fmt.Println("the device is NOT formatted!")
return "", nil
}
// The device has dependent devices, most probably partitions (LVM, LUKS
// and MD RAID are reported as FSTYPE and caught above).
return "unknown data, probably partitions", nil
}
Output
the device is formatted!
result: FSTYPE
Process finished with exit code 0
So essentially an output with just FSTYPE\n
but empty actual types returns true for that code.
EDIT: As was already mentioned in K8s 1.10 the usage of lsblk
was dumped in favor of blkid
(https://github.com/kubernetes/kubernetes/blob/release-1.10/pkg/util/mount/mount_linux.go#L562). The question remains though, when under the above circumstances a drive would be formatted?!
@christianhuening did you truncate the variable outputFromANode
in the post? It looks like GetDiskFormat
would simply return the string FSTYPE
since it is the first line of the text passed to the method.
@travisn no, that is what lsblk
outputs on my ContainerLinux nodes with mounted and working /dev/rbd? devices. Just
FSTYPE
<newline>
@christianhuening per https://github.com/kubernetes/kubernetes/pull/59050#issue-166035953, you need to install udev rules https://github.com/ceph/ceph/blob/v13.0.0/udev/60-ceph-by-parttypeuuid.rules
@christianhuening this looks like this won't reproduce anymore on Kubernetes 1.10, have you guys upgraded yet to 1.10? If you have not yet, is it feasible to install the udev rules as @rootfs suggested? Thanks!
@jbw976 Yes we updated to 1.10 by now. It's working and it was working with 1.9.x as well. What i was wondering is WHY it was working with the output generated from 'lsblk' as there was no real information in it and the code basically just checked for the line with "FSTYPE" being present.
hmm so I guess we've got to put that one into the mystery section? Or are there anymore ideas?
It seems that we encountered a similar issue. It happened a couple time in a span of ~1 month.
Deleted a pod, lost all data that was supposed to be mounted. It usually works fine, and when we retried deleting the same pod later it worked as usual, data was kept. This is a bit scary TBH.
Here's the relevant logs from rook-agent: https://gist.github.com/mickymiek/9875aab127ffdd9fbf2a1fef47b021e2
We're using k8s v1.10.0
and rook v0.7.1
.
If anyone has an idea... Thanks in advance!
Edit: I did reproduce the issue on another pod. Each time we lose our data, we get this error:
2018-04-17 13:41:35.745120 E | flexdriver: Detach volume from /var/lib/kubelet/pods/6071ef5b-3bd1-11e8-b502-0cc47ab4037e/volumes/rook.io~rook/pvc-606ca30e-3bd1-11e8-b502-0cc47ab4037e failed: Failed to detach volume block/pvc-606ca30e-3bd1-11e8-b502-0cc47ab4037e: failed to detach volume block/pvc-606ca30e-3bd1-11e8-b502-0cc47ab4037e cluster rook. failed to unmap image block/pvc-606ca30e-3bd1-11e8-b502-0cc47ab4037e: Failed to complete rbd: exit status 16. output: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy
After spending more time on this, I haven't been able to repro it or find the solution. When a pod is deleted, the fencing is preventing the pod from being mounted again until it is ready. I did observe that lsblk is not as reliable at detecting whether the volume has been formatted. At least from the toolbox pod, lsblk
shows blank info for the FSTYPE, while blkid
shows the correct file system.
I would have expected K8s 1.10 to improve this situation, but since it has been hit on K8s 1.10 by @mickymiek, we see that the error condition is also hitting with blkid
. @christianhuening did you repro or try your code snippet on 1.10?
@travisn No, the code in 1.10.x has changed and hence i cannot repro it anymore. Another suggestion by @chaosaffe was that maybe the volumes in the error cases don't get fully mounted in the first place but the system thinks so. All writes would then go to the local host and thus essentially into thin air upon pod restart. But I haven't tested his theory yet.
@mickymiek do you still have this cluster running by chance? Or can you let me know if you hit this again? I'm looking for a repro on 1.10. Thanks
Hi @travisn , sorry to keep you waiting. We now have a cluster running on 1.10.1, we didn't encountered this problem since my last comment, but I haven't actively tried to reproduce. Will do tests tomorrow, I'll let you know about the results.
thanks @mickymiek for working on the repro!
I tried to reproduce many times without success. We're now running a new cluster (different machines). So I ran tests to see if disks change could have impacted the results, but it doesn't seem like it. So what changed since the last time we lost data is our k8s upgrade (1.10.0 -> 1.10.1) and the coreOS version we're using (1745.0.0 -> 1758.0.0).
I'll keep in touch if I happen to reproduce, but everything seems stable for now. @travisn
@mickymiek thanks for looking into it, we will keep this issue active for now to confirm that it is stable on k8s 1.10.
@christianhuening can you confirm if you are seeing this issue on 1.10?
@travisn I did not see that issue on 1.10 again.
Closing this for now since we have not had a report of it since 1.10. Please reopen if we get a repro on 1.10 or newer
I'm using rook 0.7.1. The same issue happened on a node with kubelet 1.10.3:
2018-06-22 09:42:23.406885 E | flexdriver: Detach volume from /var/lib/kubelet/pods/7f14e117-7531-11e8-8ab9-74d4358cc8ef/volumes/rook.io~rook/pvc-7f13c65c-7531-11e8-8ab9-74d4358cc8ef failed: Failed to detach volume block-storage-pool/pvc-7f13c65c-7531-11e8-8ab9-74d4358cc8ef: failed to detach volume block-storage-pool/pvc-7f13c65c-7531-11e8-8ab9-74d4358cc8ef cluster rook-block. failed to unmap image block-storage-pool/pvc-7f13c65c-7531-11e8-8ab9-74d4358cc8ef: Failed to complete rbd: exit status 16. output: rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy
The cluster's master is running kubelet 1.9.4. The pod using this volume restarted somehow. However, I cannot reproduce it by by deleting pod or scaling deployment...
Entire log is here
Yesterday i got that issue with an rbd volume with a lot of small files. If the pod Is rescheduled docker completly freeze and the node become « unavailable » for minutes and then the volume got formated . It happen every Time with only that volume. It’s a 300gb volume filled with ~50kb files (probably 4 Millions files)
@sinqinc which version of rook are you using?
@christianhuening 0.8.1
4 Nodes CoresOS (stable channel) Rancher 2.0.7
cluster: id: 8157e138-3969-4fbb-bed3-6a74c77432c6 health: HEALTH_OK
services: mon: 3 daemons, quorum rook-ceph-mon15,rook-ceph-mon17,rook-ceph-mon11 mgr: a(active) mds: share-1/1/1 up {0=share-564b7c4664-jprdn=up:active}, 1 up:standby-replay osd: 16 osds: 16 up, 16 in
data: pools: 4 pools, 740 pgs objects: 170k objects, 415 GB usage: 458 GB used, 476 GB / 934 GB avail pgs: 740 active+clean
io: client: 21330 kB/s rd, 135 mB/s wr, 1293 op/s rd, 2954 op/s wr
We need to reproduce this behavior, potentially following the scenario described in https://github.com/rook/rook/issues/1553#issuecomment-415428583:
It’s a 300gb volume filled with ~50kb files (probably 4 Millions files)
@sinqinc are you still having that problem? What did you do to overcome / work around it?
I just wanted to chime in and say that I've had this issue multiple times and lost data - I'm going to be moving back to longhorn for a while until this has been resolved.
I've always managed to do it the same way (and still haven't learnt):
Sadly I don't have any current logs, but what happens last is:
I'm currently debugging with the rook-tools pod and it looks like some of the old volumes are still in rbd, so it may be possible to retrieve the data - I'll let you know my findings.
@christianhuening We still have this issue but it doesn't happen every time like I thought. I opened another issue some time ago and it seems to a relation between them #1828
@jlaw90 thanks! can you also provide your versions of Kubernetes, Operating System and rook?
@christianhuening sorry, should have included that.
Kubernetes version is v1.10.3 Operating System is Ubuntu 16.04 (xenial) rook version is v0.7.1
I've spent a bit of time investigating and there are quite a few items showing in rbd ls
that I'm unable to map and see what's inside - also the sizes reported by rbd for mapped volumes don't seem to match the sizes reported by df on the host node. I'll keep digging and let you know anything else I find...
We believe this will be fixed by #2121 since Rook was not picking up the 1.10 version of the k8s client packages. Updating the client libraries is a rather large change so we will need to let it stabilize in master before we look at backporting.
@travisn oh that would be great! any help needed?
Users @chaosaffe and @apinnecke reported their mongoDB data completely wiped out and it appears the rook-agent caused this to happen. The best guess at the repro steps right now are to delete a pod with a rook-block volume. The environment is also on very slow infrastructure which may be causing a timing issue to be surfaced.
Log and related objects can be found in https://gist.github.com/jbw976/3f354a51b80ff3fdea735d263b10d40a
The log file has a few notes of interest:
There appears to be an attach operation that gets called in the middle of on ongoing detach operation. I'm not sure why the Kubelet would make this call, but our fencing doesn't appear to be blocking it appropriately:
The unmap operation during detach fails because the device is busy, the device never gets unmapped:
The 2nd attach operation thinks the device is not formatted:
The Kubernetes mounter code that does this check uses
lsblk -n -o FSTYPE <disk>
, which is not clear to me how that would behave for a RBD device that failed to unmap and doesn't have any mounts on it. We should look further into that.Since the k8s mounter believes the disk isn't formatted, it goes ahead and performs the format again, wiping out all the data, which appears to take almost 3 minutes to complete: