Open bwjoh opened 1 month ago
I have not reproduced this with a generic AWS AMI, only with a customized Ubunutu 20 AMI. It is using Lustre client with version 1.12.8.
Was this custom AMI built using an FSx for Lustre vended 2.12.8 Lustre client?
Realized my initial post is a bit unclear - I haven't tried to reproduce this issue with a generic AWS AMI - I have only tested with a custom AMI.
The AMI we are using has Lustre client installed based on https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html
/kind bug
What happened? Ran the following job on a cluster with aws-fsx-csi-driver:
OBD devices created when mounting the file system were removed on unmount only ~59% of the time. Monitored using
lctl dl | wc -l
.There is some documentation about monitoring devices here (due to a limit of 8192 by the Lustre client): https://aws.amazon.com/blogs/storage/best-practices-for-monitoring-amazon-fsx-for-lustre-clients-and-file-systems/
What you expected to happen? After running the above job
lctl dl | wc -l
would show 0.How to reproduce it (as minimally and precisely as possible)? I have not reproduced this with a generic AWS AMI, only with a customized Ubunutu 20 AMI. It is using Lustre client with version 1.12.8.
On the host instance I haven't been able to reproduce this behaviour with
mount
andumount
directly. There are no obvious errors from syslog on the host when devices are not removed (fsx-driver logs related to unmounting are all successful).I am not sure if this is an issue with Lustre client version, something specific to the CSI workflow, or something else.
Anything else we need to know?: This has been problematic as we have workflows with short-lived pods, and nodes can be recycled frequently to avoid hitting the Lustre client 8192 device limit.
This may also be related to some memory issues we have had on nodes -
/proc/vmallocinfo
ends up with manycfs_hash_buckets_realloc
entries (looks related to the Lustre client) from the leftover devices. We have not found a way to remove these leftover devices besides recycling nodes.Any confirmation if others are hitting this issue, or guidance on how to avoid this would be appreciated!
Environment
kubectl version
): 1.30.3