I would like to highlight unexpected behavior witch I found while testing Juicefs CSI driver.
What happened:
Within the stability test, I created ~100 pods with little FS activity. After a few hours, some container jfs-mount which provides JuiceFS mount on the node, started restarting due to a memory spike > 5 GB. After the successful restart of jfs-mount, all existing mounts on the given node stay in a broken state.
What you expected to happen:
I would like to see some failed recovery behavior. It means, that if jfs-mount will be restarted, it should be able to handle existing mounts on the node.
How to reproduce it (as minimally and precisely as possible):
I'm able to reproduce this behavior by manually restarting jfs-mount.
Anything else we need to know?
I can identify two issues here.
After restarting jfs-mount all existing mounts will stay in a broken state.
Identify the root cause of the memory spike on jfs-mount. Here I have no idea how to debug this. I would greatly appreciate help with debugging.
I'm not sure if it is a technical limitation or a bug. However, I'm happy to fix this issue if someone would be able to direct me to the source of the problem.
As a workaround, I was thinking about running the mount in the sidecar of the worker pod and isolating the worker in an independent bubble. This approach could have an advantage. Livecycle of worker pod and jfs-mount will be coupled, and in case of an unexpected restart, it should be able to self recover.
It is possible to go this way or do you think it is a bad idea?
Environment:
JuiceFS CSI Driver version (which image tag did your CSI Driver use): v0.14.0
Kubernetes version (e.g. kubectl version): v1.19.16-eks-25803e
Object storage (cloud provider and region): S3, AWS, US East (N. Virginia) us-east-1
Metadata engine info (version, cloud provider managed or self maintained): Amazon ElastiCache, Redis engine, 6.2.5
Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage):
I would like to highlight unexpected behavior witch I found while testing Juicefs CSI driver.
What happened: Within the stability test, I created ~100 pods with little FS activity. After a few hours, some container
jfs-mount
which provides JuiceFS mount on the node, started restarting due to a memory spike > 5 GB. After the successful restart ofjfs-mount
, all existing mounts on the given node stay in a broken state.What you expected to happen: I would like to see some failed recovery behavior. It means, that if
jfs-mount
will be restarted, it should be able to handle existing mounts on the node.How to reproduce it (as minimally and precisely as possible): I'm able to reproduce this behavior by manually restarting
jfs-mount
.Anything else we need to know? I can identify two issues here.
jfs-mount
all existing mounts will stay in a broken state.jfs-mount
. Here I have no idea how to debug this. I would greatly appreciate help with debugging.I'm not sure if it is a technical limitation or a bug. However, I'm happy to fix this issue if someone would be able to direct me to the source of the problem.
As a workaround, I was thinking about running the mount in the sidecar of the worker pod and isolating the worker in an independent bubble. This approach could have an advantage. Livecycle of worker pod and
jfs-mount
will be coupled, and in case of an unexpected restart, it should be able to self recover. It is possible to go this way or do you think it is a bad idea?Environment:
kubectl version
): v1.19.16-eks-25803e