Allow configurable timeout for WaitForFileSystemAvailable

jon-rei commented 4 months ago

Is your feature request related to a problem? Please describe.

We are starting to use Scratch 2 FSx FileSystems and are doing so successfully for small ones. But when we try to create FSx volumes >50TB we always run into timeouts:

E0701 17:57:05.334673       1 driver.go:104] "GRPC error" err="rpc error: code = Aborted desc = Create volume request for pvc-1234 is already in progress"
E0701 17:57:06.262188       1 driver.go:104] "GRPC error" err=<
    rpc error: code = Internal desc = Filesystem is not ready: RequestCanceled: request context canceled
    caused by: context canceled
 >

It seems that the timeout is always 5 minutes, but creating a 72TB volume for example takes ~12 minutes.

Describe the solution you'd like in detail

It would be great if this timeout could be increased or made configurable since it's hardcoded here.

Describe alternatives you've considered

Right now we are trying to create the filesystems manually and then create a PV + PVC. But we could really create the volumes dynamically for the pipelines we want to run.

jon-rei commented 4 months ago

I wrote a small script to check the creation time of the filesystems and found that as they get bigger they take longer to create. I did my tests in eu-west-1 and I'm not sure if this is something specific to that region.

FSx for Lustre file system with 4800 GiB was available after: 6 minutes 44 seconds
FSx for Lustre file system with 9600 GiB was available after 9 minutes 6 seconds
FSx for Lustre file system with 24000 GiB was available after 8 minutes 32 seconds
FSx for Lustre file system with 50400 GiB was available after 9 minutes 5 seconds
FSx for Lustre file system with 100800 GiB was available after 11 minutes 19 seconds

So I think it would be great if the timeout could either grow dynamically with storage capacity or be available for manual configuration.

jon-rei commented 4 months ago

@jonded94 and I dug a little deeper to figure out where the RequestCanceled (above) was coming from and found that it's actually happening on the DescribeFileSystemsWithContext call, probably around this line. From our logs, which we enhanced a bit, we saw that the request that ends in a RequestCanceled takes just under 1 millisecond, while the successful requests take ~200ms. The successful query happens about 9 times in one of our tests, and the 10th time it fails.

Faulty request:

{"ts":1720456920534.6946,"caller":"cloud/cloud.go:392","msg":"DescribeFileSystemsWithContext","v":2,"filesystem":"fs-1234"}
{"ts":1720456920534.7668,"caller":"cloud/cloud.go:394","msg":"DescribeFileSystemsWithContextDone","v":2,"filesystem":"fs-1234","output":"{\n\n}","err":"RequestCanceled: request context canceled\ncaused by: context canceled"}

Working request:

{"ts":1720456860534.6147,"caller":"cloud/cloud.go:392","msg":"DescribeFileSystemsWithContext","v":2,"filesystem":"fs-1234"}
{"ts":1720456860741.3599,"caller":"cloud/cloud.go:394","msg":"DescribeFileSystemsWithContextDone","v":2,"filesystem":"fs-1234","output":"{\n  FileSystems: [{\n      CreationTime: 2024-07-08 16:37:00.357 +0000 UTC,\n      DNSName: \"fs-051ce4079e5c77f58.fsx.eu-west-1.amazonaws.com\",\n      FileSystemId: \"fs-1234\",\n      FileSystemType: \"LUSTRE\",\n      FileSystemTypeVersion: \"2.15\",\n      Lifecycle: \"CREATING\",\n      LustreConfiguration: {\n        CopyTagsToBackups: false,\n        DataCompressionType: \"NONE\",\n        DeploymentType: \"SCRATCH_2\",\n        LogConfiguration: {\n          Level: \"DISABLED\"\n        },\n        MountName: \"wcsrlbev\",\n        WeeklyMaintenanceStartTime: \"2:02:00\"\n      },\n      NetworkInterfaceIds: [\n        \"...a lot of enis ~60..."\n      ],\n      OwnerId: \"1234\",\n      ResourceARN: \"arn:aws:fsx:eu-west-1:1234:file-system/fs-1234\",\n      StorageCapacity: 124800,\n      StorageType: \"SSD\",\n      SubnetIds: [\"subnet-1234\"],\n      Tags: [{\n          Key: \"CSIVolumeName\",\n          Value: \"pvc-1234\"\n        },{\n          Key: \"team\",\n          Value: \"abc\"\n        }],\n      VpcId: \"vpc-1234\"\n    }]\n}","err":null}

We also found that we only get these errors for filesystems larger than 100TB. We have successfully created several <100TB FSx volumes. It seems that the large volumes always fail after 5 minutes of creation and the smaller ones do not run into this RequestCanceled.

jacobwolfaws commented 4 months ago

Editing comment: ~Might be related to: https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L822~

Looking at the readme: https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file timeout seems related to

--timeout <duration>: Timeout of all calls to CSI driver. It should be set to value that accommodates majority of ControllerCreateVolume and ControllerDeleteVolume calls. See [CSI error and timeout handling](https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file#csi-error-and-timeout-handling) for details. 15 seconds is used by default.

which wouldn't be related

It seems more likely that the describe call is catching the context cancelled vs. being the cause itself. Seems likely that the context is being canceled on the CSI side, during either the csiClient.CreateVolume or Provision calls

Potentially this? The 5 minutes would line up with your observations

--retry-interval-max <duration>: Maximum retry interval of failed provisioning or deletion. Default value is 5 minutes. See [CSI error and timeout handling](https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file#csi-error-and-timeout-handling) for details.

jon-rei commented 4 months ago

We wanted to do a deeper dive into what the problem is and test when exactly the filesystem creation fails and when it succeeds. For this it's good to know what it looks like in our account. We currently have a quota of 300TB for Lustre FSx and have manually created a 110TB filesystem that is always present during our tests.

Unsuccessful test

For our first test we created a 150TB volume:

Creating a new PVC with a claim for the 150TB filesystem
The csi-driver logs that it saw the creation of the PVC and starting the provisioning
The fsx-plugin creates the filesystem and we can see in the AWS console that it's being in "Creating"
After 5 minutes the csi-provisioner gets a timeout (the timeout mentioned in the comment above) and logs Retry syncing claim
The fsx-plugin now logs again CreateVolume: called and calls CreateFileSystem which seems to check our quota usage and since we are using 150+110TB out of 300TB it fails because the quota would be exceeded with "another" 150TB filesystem

Successful test

Now we tested with another smaller volume and created a 90TB volume: (the first 4 steps are exactly the same)

The fsx-plugin now logs CreateVolume: called again, but instead of failing it just logs Create volume request for pvc-1234 is already in progress, we think the quota check is successful here, since 90+110TB out of 300TB still leaves room for a 90TB filesystem.

From what we observed in our tests, after the csi provider timed out, the CreateVolume function is called again and instead of checking if a filesystem is already being created, it first checks if the quota would be reached with a "new" filesystem.

Workaround

In a next test, we tried to increase all configurable timeouts, mainly for the csi-provisioner (timeout and retry-interval-max) to 15 minutes, using a local build of our PR #383. This allowed us to create the 150TB filesystem that previously failed in about 13 minutes. It was only successful because none of the components involved had timeouts that would have triggered another quota check. We also had a test run where the currently active csi driver changed it's leader and this led to another crash loop of the fsx-plugin. In the end, increasing these timeouts is probably not necessary, as there is a logic for detecting volumes already in progress, but this will cause problems if the quotas are set too tight.

A temporary fix for us now would be to increase the quota for Lustre FSx, and in the long term it would be helpful to change the logic of how an ongoing volume request is detected so that it happens before a new volume is created.

jacobwolfaws commented 4 months ago

Note: It seems like the quota check is being triggered by a new create api call. Needs further investigation to confirm

JoseAlvarezSonos commented 4 months ago

Hello, we are trying to create a StorageClass with the following conf:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: subnet-123
  securityGroupIds: sg-123
  autoImportPolicy: NONE
  s3ImportPath: s3://s3-bucket
  deploymentType: SCRATCH_2
  fileSystemTypeVersion: "2.15"
mountOptions:
  - flock

And it fails, if I try to create a FS with the same parameters via the UI, it gets created in no-less than 6 minutes, therefore the 5 minute timeout being a little bit problematic, any ideas when would this be solved? Thanks 🙏

jacobwolfaws commented 4 months ago

@JoseAlvarezSonos have you tried creating with using the dynamic provisioning mechanism? The 5 minute timeout shouldn't impact being able to provision a new filesystem unless an additional call would exceed the quota

the quota would be exceeded with "another"

JoseAlvarezSonos commented 4 months ago

@jacobwolfaws we are using the Dynamic Provisioning with Data Repository example from this repo. That's what you mean by "dynamic provisioning mechanism" right?

jacobwolfaws commented 4 months ago

@jacobwolfaws we are using the Dynamic Provisioning with Data Repository example from this repo. That's what you mean by "dynamic provisioning mechanism" right?

yes

JoseAlvarezSonos commented 4 months ago

@jacobwolfaws in that case yes, but we see the exact same problem:

fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin I0715 14:26:14.476895       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-123" status="CREATING"
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin I0715 14:26:44.447783       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-123" status="CREATING"
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner I0715 14:27:14.122762       1 controller.go:1082] Temporary error received, adding PVC 123456asdf to claims in progress
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner W0715 14:27:14.122810       1 controller.go:934] Retrying syncing claim "123456asdf", failure 5
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner E0715 14:27:14.122837       1 controller.go:957] error syncing claim "123456asdf": failed to provision volume with StorageClass "fsx-sc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner I0715 14:27:14.122866       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"123456asdf", APIVersion:"v1", ResourceVersion:"240051348", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "fsx-sc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin E0715 14:27:14.292413       1 driver.go:104] "GRPC error" err=<
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin  rpc error: code = Internal desc = Filesystem is not ready: RequestCanceled: request context canceled
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin  caused by: context canceled
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin  >

BTW I obfuscated some values.

jacobwolfaws commented 4 months ago

I recommend configuring the csi-provisioner to have retry-interval-max of 15 minutes. I'm not sure if @jon-rei tested it with the driver as is, but it's worth a shot to see if it resolves your issues.

JoseAlvarezSonos commented 4 months ago

@jacobwolfaws I read that, but the Helm chart doesn't provide a way to configure that. The args, where I assume I would need to add this, are hardcoded and limited to this, any recommendation on that front?

jacobwolfaws commented 4 months ago

In the short term, you could use kustomize or maintain your own version of the helm chart? This is a gap we'll close before the next csi driver release

jonded94 commented 4 months ago

@jacobwolfaws several things were involved in @jon-rei and my tests to make a dynamic volume creation of sizes >80TiB possible through this driver.

One has to raise the timeouts of the csi-provisioner, otherwise the volume creation will be triggered again which can crash if the Lustre storage quota is exceeded
Ontop of that, one also had to use our already provided PR https://github.com/kubernetes-sigs/aws-fsx-csi-driver/pull/383 that will increase the internal timeout of the csi-driver of the WaitForFileSystemAvailable call, otherwise this will error out

Have you been testing our PR or could debug the double creation call?

JoseAlvarezSonos commented 4 months ago

@jacobwolfaws I tried your suggestion but it didn't work. To try "an extreme" to see if that was the issue, I increased the timeout and retry-interval-max to 60m, but it seems that around 10m it fails, which is somehow consistent with the original problem statement from this issue. Any other suggestions?

          args:
            - --csi-address=$(ADDRESS)
            - --v={{ .Values.sidecars.provisioner.logLevel }}
            - --timeout=60m
            - --retry-interval-max=60m
            - --extra-create-metadata
            - --leader-election=true

Logs:

fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:10:43.104217       1 controller.go:1366] provision "dev/fsx-claim" class "fsx-sc": started
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:10:43.105202       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"1234asdf", APIVersion:"v1", ResourceVersion:"240151670", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dev/fsx-claim"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:11:14.844059       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:11:44.761550       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:12:14.762342       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:12:44.797206       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:13:14.756297       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:13:44.761706       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:14:14.833685       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:14:44.756280       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:15:14.771227       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:15:44.753475       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:16:14.763102       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:16:44.780487       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:17:14.751876       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:17:44.779109       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:18:14.754597       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:18:44.750220       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:19:14.745984       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:19:44.755963       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:20:14.751274       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:20:44.722817       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:20:44.932801       1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin E0715 16:20:44.932894       1 driver.go:104] "GRPC error" err="rpc error: code = Internal desc = Filesystem is not ready: timed out waiting for the condition"
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:20:44.933487       1 controller.go:1075] Final error received, removing PVC 1234asdf from claims in progress
fsx-csi-controller-56b4dbc96-679rb csi-provisioner W0715 16:20:44.933529       1 controller.go:934] Retrying syncing claim "1234asdf", failure 0
fsx-csi-controller-56b4dbc96-679rb csi-provisioner E0715 16:20:44.933575       1 controller.go:957] error syncing claim "1234asdf": failed to provision volume with StorageClass "fsx-sc": rpc error: code = Internal desc = Filesystem is not ready: timed out waiting for the condition
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:20:44.933619       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"1234asdf", APIVersion:"v1", ResourceVersion:"240151670", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "fsx-sc": rpc error: code = Internal desc = Filesystem is not ready: timed out waiting for the condition

jonded94 commented 4 months ago

Not an expert here but the 10min timeout in WaitForFileSystemAvailable sounds pretty much like the issue that could be solved by trying out our forked version: https://github.com/kubernetes-sigs/aws-fsx-csi-driver/pull/383 Can you try building that and test if that works with it?

jacobwolfaws commented 4 months ago

Can you also collect the logs from the provisioner and fsx-plugin container

kubectl logs fsx-csi-controller-${CONTROLLER_SUFFIX} -n $NAMESPACE fsx-plugin --v=5 >> fsx-csi-controller-pod-plugin-${CONTROLLER_SUFFIX}.logs
kubectl logs fsx-csi-controller-${CONTROLLER_SUFFIX} -n $NAMESPACE csi-provisioner --v=5 >> fsx-csi-controller-pod-provisioner-${CONTROLLER_SUFFIX}.logs

would also help to increase loglevel to 5 for those containers in the values.yaml file of your chart

JoseAlvarezSonos commented 4 months ago

@jonded94 I'll try and let you know, thanks

@jacobwolfaws I enabled more logging, but I have tons of leaderelection messages which are not that useful nor interesting, and then the only other outstanding messages are these:

fsx-csi-controller-5656c55766-thkw6 csi-provisioner I0716 09:12:45.091618       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"9e02b8ae-8ad8-4605-acbd-aac4a9708996", APIVersion:"v1", ResourceVersion:"240869156", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dev/fsx-claim"
fsx-csi-controller-5656c55766-thkw6 csi-provisioner I0716 09:12:45.091524       1 connection.go:245] GRPC request: {"capacity_range":{"required_bytes":1288490188800},"name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","parameters":{"autoImportPolicy":"NONE","csi.storage.k8s.io/pv/name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","csi.storage.k8s.io/pvc/name":"fsx-claim","csi.storage.k8s.io/pvc/namespace":"dev","deploymentType":"SCRATCH_2","extraTags":"service=TAG1,project=TAG2","fileSystemTypeVersion":"2.15","s3ExportPath":"s3://s3-bucket/export","s3ImportPath":"s3://s3-bucket","securityGroupIds":"sg-1234","subnetId":"subnet-1234"},"volume_capabilities":[{"AccessType":{"Mount":{"mount_flags":["flock"]}},"access_mode":{"mode":5}}]}
fsx-csi-controller-5656c55766-thkw6 fsx-plugin I0716 09:12:45.092512       1 controller.go:109] "CreateVolume: called" args={"name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","capacity_range":{"required_bytes":1288490188800},"volume_capabilities":[{"AccessType":{"Mount":{"mount_flags":["flock"]}},"access_mode":{"mode":5}}],"parameters":{"autoImportPolicy":"NONE","csi.storage.k8s.io/pv/name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","csi.storage.k8s.io/pvc/name":"fsx-claim","csi.storage.k8s.io/pvc/namespace":"dev","deploymentType":"SCRATCH_2","extraTags":"service=TAG1,project=TAG2","fileSystemTypeVersion":"2.15","s3ExportPath":"s3://s3-bucket/export","s3ImportPath":"s3://s3-bucket","securityGroupIds":"sg-1234","subnetId":"subnet-1234"}}

It's worth mentioning that I tried adding an exportPath to see if it helps, but it didn't. I also tried with a smaller S3 bucket (less than 5MB total size), and it worked around 5-10min, but the "important" bucket is around 200TB and in production it will be around twice the size. Nonetheless, if I try to create the FS manually, it works, so maybe there's parameter missing to avoid copying the whole content of the bucket or populating the metadata or whatever is doing that it's taking that long.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jacobwolfaws commented 1 month ago

/remove-lifecycle stale @JoseAlvarezSonos did increasing the timeout with a custom image help with your issue as @jonded94 suggested?

jacobwolfaws commented 1 month ago

@jonded94

Have you been testing our PR or could debug the double creation call?

It seems that while the CreateVolume call is idempotent, the CreateFilesystem call is not (after the first CreateVolume reaches its timeout). We recently had an issue with there being orphaned filesystems in the k8s testing account:

https://github.com/kubernetes-sigs/aws-fsx-csi-driver/pull/389 https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/390

and I'm inclined to believe the issues are related. I believe fixing the CreateFilesystem issue / having CreateVolume check for a filesystem already being created should also fix the problems you've been seeing

kubernetes-sigs / aws-fsx-csi-driver