Open jon-rei opened 4 months ago
I wrote a small script to check the creation time of the filesystems and found that as they get bigger they take longer to create. I did my tests in eu-west-1 and I'm not sure if this is something specific to that region.
FSx for Lustre file system with 4800 GiB was available after: 6 minutes 44 seconds
FSx for Lustre file system with 9600 GiB was available after 9 minutes 6 seconds
FSx for Lustre file system with 24000 GiB was available after 8 minutes 32 seconds
FSx for Lustre file system with 50400 GiB was available after 9 minutes 5 seconds
FSx for Lustre file system with 100800 GiB was available after 11 minutes 19 seconds
So I think it would be great if the timeout could either grow dynamically with storage capacity or be available for manual configuration.
@jonded94 and I dug a little deeper to figure out where the RequestCanceled (above) was coming from and found that it's actually happening on the DescribeFileSystemsWithContext
call, probably around this line.
From our logs, which we enhanced a bit, we saw that the request that ends in a RequestCanceled takes just under 1 millisecond, while the successful requests take ~200ms. The successful query happens about 9 times in one of our tests, and the 10th time it fails.
Faulty request:
{"ts":1720456920534.6946,"caller":"cloud/cloud.go:392","msg":"DescribeFileSystemsWithContext","v":2,"filesystem":"fs-1234"}
{"ts":1720456920534.7668,"caller":"cloud/cloud.go:394","msg":"DescribeFileSystemsWithContextDone","v":2,"filesystem":"fs-1234","output":"{\n\n}","err":"RequestCanceled: request context canceled\ncaused by: context canceled"}
Working request:
{"ts":1720456860534.6147,"caller":"cloud/cloud.go:392","msg":"DescribeFileSystemsWithContext","v":2,"filesystem":"fs-1234"}
{"ts":1720456860741.3599,"caller":"cloud/cloud.go:394","msg":"DescribeFileSystemsWithContextDone","v":2,"filesystem":"fs-1234","output":"{\n FileSystems: [{\n CreationTime: 2024-07-08 16:37:00.357 +0000 UTC,\n DNSName: \"fs-051ce4079e5c77f58.fsx.eu-west-1.amazonaws.com\",\n FileSystemId: \"fs-1234\",\n FileSystemType: \"LUSTRE\",\n FileSystemTypeVersion: \"2.15\",\n Lifecycle: \"CREATING\",\n LustreConfiguration: {\n CopyTagsToBackups: false,\n DataCompressionType: \"NONE\",\n DeploymentType: \"SCRATCH_2\",\n LogConfiguration: {\n Level: \"DISABLED\"\n },\n MountName: \"wcsrlbev\",\n WeeklyMaintenanceStartTime: \"2:02:00\"\n },\n NetworkInterfaceIds: [\n \"...a lot of enis ~60..."\n ],\n OwnerId: \"1234\",\n ResourceARN: \"arn:aws:fsx:eu-west-1:1234:file-system/fs-1234\",\n StorageCapacity: 124800,\n StorageType: \"SSD\",\n SubnetIds: [\"subnet-1234\"],\n Tags: [{\n Key: \"CSIVolumeName\",\n Value: \"pvc-1234\"\n },{\n Key: \"team\",\n Value: \"abc\"\n }],\n VpcId: \"vpc-1234\"\n }]\n}","err":null}
We also found that we only get these errors for filesystems larger than 100TB. We have successfully created several <100TB FSx volumes. It seems that the large volumes always fail after 5 minutes of creation and the smaller ones do not run into this RequestCanceled.
Editing comment: ~Might be related to: https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L822~
Looking at the readme: https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file timeout seems related to
--timeout <duration>: Timeout of all calls to CSI driver. It should be set to value that accommodates majority of ControllerCreateVolume and ControllerDeleteVolume calls. See [CSI error and timeout handling](https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file#csi-error-and-timeout-handling) for details. 15 seconds is used by default.
which wouldn't be related
It seems more likely that the describe call is catching the context cancelled vs. being the cause itself. Seems likely that the context is being canceled on the CSI side, during either the csiClient.CreateVolume or Provision calls
Potentially this? The 5 minutes would line up with your observations
--retry-interval-max <duration>: Maximum retry interval of failed provisioning or deletion. Default value is 5 minutes. See [CSI error and timeout handling](https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file#csi-error-and-timeout-handling) for details.
We wanted to do a deeper dive into what the problem is and test when exactly the filesystem creation fails and when it succeeds. For this it's good to know what it looks like in our account. We currently have a quota of 300TB for Lustre FSx and have manually created a 110TB filesystem that is always present during our tests.
For our first test we created a 150TB volume:
Retry syncing claim
CreateVolume: called
and calls CreateFileSystem which seems to check our quota usage and since we are using 150+110TB out of 300TB it fails because the quota would be exceeded with "another" 150TB filesystemNow we tested with another smaller volume and created a 90TB volume: (the first 4 steps are exactly the same)
CreateVolume: called
again, but instead of failing it just logs Create volume request for pvc-1234 is already in progress
, we think the quota check is successful here, since 90+110TB out of 300TB still leaves room for a 90TB filesystem.From what we observed in our tests, after the csi provider timed out, the CreateVolume function is called again and instead of checking if a filesystem is already being created, it first checks if the quota would be reached with a "new" filesystem.
In a next test, we tried to increase all configurable timeouts, mainly for the csi-provisioner (timeout
and retry-interval-max
) to 15 minutes, using a local build of our PR #383.
This allowed us to create the 150TB filesystem that previously failed in about 13 minutes. It was only successful because none of the components involved had timeouts that would have triggered another quota check. We also had a test run where the currently active csi driver changed it's leader and this led to another crash loop of the fsx-plugin.
In the end, increasing these timeouts is probably not necessary, as there is a logic for detecting volumes already in progress, but this will cause problems if the quotas are set too tight.
A temporary fix for us now would be to increase the quota for Lustre FSx, and in the long term it would be helpful to change the logic of how an ongoing volume request is detected so that it happens before a new volume is created.
Note: It seems like the quota check is being triggered by a new create api call. Needs further investigation to confirm
Hello, we are trying to create a StorageClass with the following conf:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
subnetId: subnet-123
securityGroupIds: sg-123
autoImportPolicy: NONE
s3ImportPath: s3://s3-bucket
deploymentType: SCRATCH_2
fileSystemTypeVersion: "2.15"
mountOptions:
- flock
And it fails, if I try to create a FS with the same parameters via the UI, it gets created in no-less than 6 minutes, therefore the 5 minute timeout being a little bit problematic, any ideas when would this be solved? Thanks 🙏
@JoseAlvarezSonos have you tried creating with using the dynamic provisioning mechanism? The 5 minute timeout shouldn't impact being able to provision a new filesystem unless an additional call would exceed the quota
the quota would be exceeded with "another"
@jacobwolfaws we are using the Dynamic Provisioning with Data Repository example from this repo. That's what you mean by "dynamic provisioning mechanism" right?
@jacobwolfaws we are using the Dynamic Provisioning with Data Repository example from this repo. That's what you mean by "dynamic provisioning mechanism" right?
yes
@jacobwolfaws in that case yes, but we see the exact same problem:
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin I0715 14:26:14.476895 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-123" status="CREATING"
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin I0715 14:26:44.447783 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-123" status="CREATING"
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner I0715 14:27:14.122762 1 controller.go:1082] Temporary error received, adding PVC 123456asdf to claims in progress
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner W0715 14:27:14.122810 1 controller.go:934] Retrying syncing claim "123456asdf", failure 5
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner E0715 14:27:14.122837 1 controller.go:957] error syncing claim "123456asdf": failed to provision volume with StorageClass "fsx-sc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
fsx-csi-controller-7c7d54f5f6-th5wh csi-provisioner I0715 14:27:14.122866 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"123456asdf", APIVersion:"v1", ResourceVersion:"240051348", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "fsx-sc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin E0715 14:27:14.292413 1 driver.go:104] "GRPC error" err=<
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin rpc error: code = Internal desc = Filesystem is not ready: RequestCanceled: request context canceled
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin caused by: context canceled
fsx-csi-controller-7c7d54f5f6-th5wh fsx-plugin >
BTW I obfuscated some values.
I recommend configuring the csi-provisioner to have retry-interval-max
of 15 minutes. I'm not sure if @jon-rei tested it with the driver as is, but it's worth a shot to see if it resolves your issues.
@jacobwolfaws I read that, but the Helm chart doesn't provide a way to configure that. The args, where I assume I would need to add this, are hardcoded and limited to this, any recommendation on that front?
In the short term, you could use kustomize or maintain your own version of the helm chart? This is a gap we'll close before the next csi driver release
@jacobwolfaws several things were involved in @jon-rei and my tests to make a dynamic volume creation of sizes >80TiB possible through this driver.
WaitForFileSystemAvailable
call, otherwise this will error outHave you been testing our PR or could debug the double creation call?
@jacobwolfaws I tried your suggestion but it didn't work. To try "an extreme" to see if that was the issue, I increased the timeout
and retry-interval-max
to 60m
, but it seems that around 10m it fails, which is somehow consistent with the original problem statement from this issue. Any other suggestions?
args:
- --csi-address=$(ADDRESS)
- --v={{ .Values.sidecars.provisioner.logLevel }}
- --timeout=60m
- --retry-interval-max=60m
- --extra-create-metadata
- --leader-election=true
Logs:
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:10:43.104217 1 controller.go:1366] provision "dev/fsx-claim" class "fsx-sc": started
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:10:43.105202 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"1234asdf", APIVersion:"v1", ResourceVersion:"240151670", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dev/fsx-claim"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:11:14.844059 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:11:44.761550 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:12:14.762342 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:12:44.797206 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:13:14.756297 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:13:44.761706 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:14:14.833685 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:14:44.756280 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:15:14.771227 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:15:44.753475 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:16:14.763102 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:16:44.780487 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:17:14.751876 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:17:44.779109 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:18:14.754597 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:18:44.750220 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:19:14.745984 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:19:44.755963 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:20:14.751274 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:20:44.722817 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin I0715 16:20:44.932801 1 cloud.go:338] "WaitForFileSystemAvailable" filesystem="fs-1234" status="CREATING"
fsx-csi-controller-56b4dbc96-679rb fsx-plugin E0715 16:20:44.932894 1 driver.go:104] "GRPC error" err="rpc error: code = Internal desc = Filesystem is not ready: timed out waiting for the condition"
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:20:44.933487 1 controller.go:1075] Final error received, removing PVC 1234asdf from claims in progress
fsx-csi-controller-56b4dbc96-679rb csi-provisioner W0715 16:20:44.933529 1 controller.go:934] Retrying syncing claim "1234asdf", failure 0
fsx-csi-controller-56b4dbc96-679rb csi-provisioner E0715 16:20:44.933575 1 controller.go:957] error syncing claim "1234asdf": failed to provision volume with StorageClass "fsx-sc": rpc error: code = Internal desc = Filesystem is not ready: timed out waiting for the condition
fsx-csi-controller-56b4dbc96-679rb csi-provisioner I0715 16:20:44.933619 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"1234asdf", APIVersion:"v1", ResourceVersion:"240151670", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "fsx-sc": rpc error: code = Internal desc = Filesystem is not ready: timed out waiting for the condition
Not an expert here but the 10min timeout in WaitForFileSystemAvailable
sounds pretty much like the issue that could be solved by trying out our forked version: https://github.com/kubernetes-sigs/aws-fsx-csi-driver/pull/383
Can you try building that and test if that works with it?
Can you also collect the logs from the provisioner and fsx-plugin container
kubectl logs fsx-csi-controller-${CONTROLLER_SUFFIX} -n $NAMESPACE fsx-plugin --v=5 >> fsx-csi-controller-pod-plugin-${CONTROLLER_SUFFIX}.logs
kubectl logs fsx-csi-controller-${CONTROLLER_SUFFIX} -n $NAMESPACE csi-provisioner --v=5 >> fsx-csi-controller-pod-provisioner-${CONTROLLER_SUFFIX}.logs
would also help to increase loglevel to 5 for those containers in the values.yaml file of your chart
@jonded94 I'll try and let you know, thanks
@jacobwolfaws I enabled more logging, but I have tons of leaderelection messages which are not that useful nor interesting, and then the only other outstanding messages are these:
fsx-csi-controller-5656c55766-thkw6 csi-provisioner I0716 09:12:45.091618 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dev", Name:"fsx-claim", UID:"9e02b8ae-8ad8-4605-acbd-aac4a9708996", APIVersion:"v1", ResourceVersion:"240869156", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dev/fsx-claim"
fsx-csi-controller-5656c55766-thkw6 csi-provisioner I0716 09:12:45.091524 1 connection.go:245] GRPC request: {"capacity_range":{"required_bytes":1288490188800},"name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","parameters":{"autoImportPolicy":"NONE","csi.storage.k8s.io/pv/name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","csi.storage.k8s.io/pvc/name":"fsx-claim","csi.storage.k8s.io/pvc/namespace":"dev","deploymentType":"SCRATCH_2","extraTags":"service=TAG1,project=TAG2","fileSystemTypeVersion":"2.15","s3ExportPath":"s3://s3-bucket/export","s3ImportPath":"s3://s3-bucket","securityGroupIds":"sg-1234","subnetId":"subnet-1234"},"volume_capabilities":[{"AccessType":{"Mount":{"mount_flags":["flock"]}},"access_mode":{"mode":5}}]}
fsx-csi-controller-5656c55766-thkw6 fsx-plugin I0716 09:12:45.092512 1 controller.go:109] "CreateVolume: called" args={"name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","capacity_range":{"required_bytes":1288490188800},"volume_capabilities":[{"AccessType":{"Mount":{"mount_flags":["flock"]}},"access_mode":{"mode":5}}],"parameters":{"autoImportPolicy":"NONE","csi.storage.k8s.io/pv/name":"pvc-9e02b8ae-8ad8-4605-acbd-aac4a9708996","csi.storage.k8s.io/pvc/name":"fsx-claim","csi.storage.k8s.io/pvc/namespace":"dev","deploymentType":"SCRATCH_2","extraTags":"service=TAG1,project=TAG2","fileSystemTypeVersion":"2.15","s3ExportPath":"s3://s3-bucket/export","s3ImportPath":"s3://s3-bucket","securityGroupIds":"sg-1234","subnetId":"subnet-1234"}}
It's worth mentioning that I tried adding an exportPath to see if it helps, but it didn't. I also tried with a smaller S3 bucket (less than 5MB total size), and it worked around 5-10min, but the "important" bucket is around 200TB and in production it will be around twice the size. Nonetheless, if I try to create the FS manually, it works, so maybe there's parameter missing to avoid copying the whole content of the bucket or populating the metadata or whatever is doing that it's taking that long.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale @JoseAlvarezSonos did increasing the timeout with a custom image help with your issue as @jonded94 suggested?
@jonded94
Have you been testing our PR or could debug the double creation call?
It seems that while the CreateVolume call is idempotent, the CreateFilesystem call is not (after the first CreateVolume reaches its timeout). We recently had an issue with there being orphaned filesystems in the k8s testing account:
https://github.com/kubernetes-sigs/aws-fsx-csi-driver/pull/389 https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/390
and I'm inclined to believe the issues are related. I believe fixing the CreateFilesystem issue / having CreateVolume check for a filesystem already being created should also fix the problems you've been seeing
Is your feature request related to a problem? Please describe.
We are starting to use Scratch 2 FSx FileSystems and are doing so successfully for small ones. But when we try to create FSx volumes >50TB we always run into timeouts:
It seems that the timeout is always 5 minutes, but creating a 72TB volume for example takes ~12 minutes.
Describe the solution you'd like in detail
It would be great if this timeout could be increased or made configurable since it's hardcoded here.
Describe alternatives you've considered
Right now we are trying to create the filesystems manually and then create a PV + PVC. But we could really create the volumes dynamically for the pipelines we want to run.