juicedata / juicefs-csi-driver

JuiceFS CSI Driver
https://github.com/juicedata/juicefs
Apache License 2.0
218 stars 82 forks source link

[BUG] Restart jfs-mount container break down existing mounts on the node. #343

Closed chudyandrej closed 2 years ago

chudyandrej commented 2 years ago

I would like to highlight unexpected behavior witch I found while testing Juicefs CSI driver.

What happened: Within the stability test, I created ~100 pods with little FS activity. After a few hours, some container jfs-mount which provides JuiceFS mount on the node, started restarting due to a memory spike > 5 GB. After the successful restart of jfs-mount, all existing mounts on the given node stay in a broken state.

What you expected to happen: I would like to see some failed recovery behavior. It means, that if jfs-mount will be restarted, it should be able to handle existing mounts on the node.

How to reproduce it (as minimally and precisely as possible): I'm able to reproduce this behavior by manually restarting jfs-mount.

Anything else we need to know? I can identify two issues here.

  1. After restarting jfs-mount all existing mounts will stay in a broken state. Screenshot 2022-05-27 at 15 38 41
  2. Identify the root cause of the memory spike on jfs-mount. Here I have no idea how to debug this. I would greatly appreciate help with debugging.

I'm not sure if it is a technical limitation or a bug. However, I'm happy to fix this issue if someone would be able to direct me to the source of the problem.

As a workaround, I was thinking about running the mount in the sidecar of the worker pod and isolating the worker in an independent bubble. This approach could have an advantage. Livecycle of worker pod and jfs-mount will be coupled, and in case of an unexpected restart, it should be able to self recover. It is possible to go this way or do you think it is a bad idea?

Environment:

zwwhdls commented 2 years ago

Hi @chudyandrej , mountpoint will break after FUSE process restarted. JuiceFS CSI driver supports mountpoint recovers automatically, please refer doc: https://juicefs.com/docs/csi/recover-failed-mountpoint/