kubernetes-sigs / aws-fsx-csi-driver

CSI Driver of Amazon FSx for Lustre https://aws.amazon.com/fsx/lustre/
Apache License 2.0
128 stars 83 forks source link

OBD devices are not always removed on umount #395

Open bwjoh opened 1 month ago

bwjoh commented 1 month ago

/kind bug

What happened? Ran the following job on a cluster with aws-fsx-csi-driver:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: mount-stress-
spec:
  parallelism: 1
  completions: 100
  ttlSecondsAfterFinished: 10
  template:
    spec:
      containers:
      - name: busybox-mount
        image: busybox
        imagePullPolicy: IfNotPresent
        command: ['sh', '-c', 'echo "Test Job Start" && sleep 15 && echo "Test Job End" && exit 0']
        resources:
          limits:
            memory: "2048Mi"
            cpu: "500m"
          requests:
            memory: "2048Mi"
            cpu: "500m"
        volumeMounts:
          - mountPath: /mnt/fsx/test
            name: fsx-mount
      restartPolicy: Never
      volumes:
        - name: fsx-mount
          persistentVolumeClaim:
            claimName: lustre-test
  backoffLimit: 4

OBD devices created when mounting the file system were removed on unmount only ~59% of the time. Monitored using lctl dl | wc -l.

There is some documentation about monitoring devices here (due to a limit of 8192 by the Lustre client): https://aws.amazon.com/blogs/storage/best-practices-for-monitoring-amazon-fsx-for-lustre-clients-and-file-systems/

What you expected to happen? After running the above job lctl dl | wc -l would show 0.

How to reproduce it (as minimally and precisely as possible)? I have not reproduced this with a generic AWS AMI, only with a customized Ubunutu 20 AMI. It is using Lustre client with version 1.12.8.

On the host instance I haven't been able to reproduce this behaviour with mount and umount directly. There are no obvious errors from syslog on the host when devices are not removed (fsx-driver logs related to unmounting are all successful).

I am not sure if this is an issue with Lustre client version, something specific to the CSI workflow, or something else.

Anything else we need to know?: This has been problematic as we have workflows with short-lived pods, and nodes can be recycled frequently to avoid hitting the Lustre client 8192 device limit.

This may also be related to some memory issues we have had on nodes - /proc/vmallocinfo ends up with many cfs_hash_buckets_realloc entries (looks related to the Lustre client) from the leftover devices. We have not found a way to remove these leftover devices besides recycling nodes.

Any confirmation if others are hitting this issue, or guidance on how to avoid this would be appreciated!

Environment

jacobwolfaws commented 1 month ago

I have not reproduced this with a generic AWS AMI, only with a customized Ubunutu 20 AMI. It is using Lustre client with version 1.12.8.

Was this custom AMI built using an FSx for Lustre vended 2.12.8 Lustre client?

bwjoh commented 1 month ago

Realized my initial post is a bit unclear - I haven't tried to reproduce this issue with a generic AWS AMI - I have only tested with a custom AMI.

The AMI we are using has Lustre client installed based on https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html