vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

velero-plugin-for-aws v1.9.x no longer works with S3-compatible BackupStorageLocation #7828

Open losil opened 6 months ago

losil commented 6 months ago

What steps did you take and what happened: We have updated our velero deployment with the latest Helm chart 6.4.0 which installs velero 1.13.2. With this upgrade the version of the velero-plugin-for-aws plugin has also been updated to v1.9.0 respectively v1.9.2 during troubleshooting. The upgrade itself went through smoothly. Also the BackupStorageLocation which is a S3-compatible NetApp StorageGrid backend was in Available state after velero was initialized. After that we tested some backup with all were unsuccessful and ended in the state Failed. We noticed that that during the backup run the BackupStorageLocation went to Unavailable with the corresponding log message:

BackupStorageLocation "netapp-s3" is unavailable: rpc error: code = Unknown desc = operation error S3: ListObjectsV2, https response error StatusCode: 403, RequestID: 1716281754869367, HostID: 12783833, api error AccessDenied: V4 authentication signed header not found: accept-encoding

The configuration of the BackupStorageLocation looks like this and as said is a S3-compatible NetApp Storagegrid system:

configuration:
  backupStorageLocation:
    - name: netapp-s3
      provider: aws
      bucket: mycluster
      prefix: velero
      default: true
      accessMode: ReadWrite
      credential:
        name: velero-s3-credentials
        key: cloud
      config:
        region: myregion
        s3ForcePathStyle: true
        s3Url: https://objectstore.localdomain.local:10443/
        signatureVersion: "4"

After the Backup run has ended velero marked the BackupStorageLocation as Available again in its regularly validation schedule.

Downgrading the velcro-plugin-for-aws to v1.8.2 solves the issue and the Backups are successful again.

What did you expect to happen:

We expect the same behavior when using the current version of the velcro-plugin-for-aws initContainer. Velero should be able to use the S3-compatible backend provided by NetApp Storagegrid.

The following information will help us better understand what's going on:

bundle-2024-05-27-09-20-52.tar.gz

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 6 months ago

I didn't find useful information in the bundle other than V4 authentication signed header not found: accept-encoding.

This should be related to the AWS plugin SDK version bumping to v2 in the v1.9.x. Per my understanding, the AWS SDK v2 uses the v4 accept-encoding by default, so the signatureVersion was deleted from the AWS plugin configuration.

After the SDK version bumping, we already saw some errors caused by the S3-compatible backend not compatible to the S3 spec. Not sure whether this is caused by the NetApp Storagegrid inconsistency with the S3.

losil commented 6 months ago

After checking the official documentation it seems that they did not implement/support the Accept-Encoding header.

@blackpiglet do you see any problems when using velero 1.13.x in combination with velero-plugin-for-aws 1.8.x?

blackpiglet commented 6 months ago

I haven't tried that, but if your scenario doesn't require the new parameters (tagging and checksumAlgorithm) added in release 1.9, then it should work.

scaleoutsean commented 5 months ago

After checking the official documentation it seems that they did not implement/support the Accept-Encoding header.

I think that doesn't mean it must. From [3]:

As long as the identity;q=0 or *;q=0 directives do not explicitly forbid the identity value that means no encoding, the server must never return a 406 Not Acceptable error.

Also from [3], why preferred Accepted-Encoding may not be acceptable to the server:

Two common cases lead to this: The data to be sent is already compressed. The server is overloaded and cannot allocate computing resources.

It seems the server should respond rather than silently handle. Does it return 406 Not Acceptable [1], 415 [2] or the content itself in response? If not, then I'd create issue with the S3 vendor. If yes, then Velero client shouldn't fail.

IMO Velero client may prefer whatever it does, but should accept any (*) with a > 0 preference value (*;q=0.001) to cater to those common cases above. Is "any" (original representation) accepted by Velero now, i.e. is qvalues weighting for * > 0?

I don't know if that would help when/if the server doesn't handle that header (in which case it may be responding with original content, but Velero maybe does not accept it). In that case Velero could still work around that by trying whichever way works (with, or without specific encoding), but arguably the S3 vendor should fix their code to handle the header better and Velero should accept the original representation (if it now does not at this time) as mentioned in [3].

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/406 [2] https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/415 [3] https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding [4] https://datatracker.ietf.org/doc/html/rfc7231#section-5.3.4

reasonerjt commented 3 months ago

After reading the comments. It seems to me the gap is that the netapp s3 service does not work with aws-sdk-v2.

@losil You may tweak the code and see if it may work with some parameters change when calling the sdk, but I don't think we can make sure the plugin works with EVERY storage which declares it's s3-compatible but indeed may work differently in details from AWS S3.

kaovilai commented 1 month ago

duped by https://github.com/vmware-tanzu/velero/issues/8152

Long term we should add another plugin that uses an SDK that has the ability to ignore that accept-encoding header like https://github.com/minio/minio-go/blob/99336902dd57f3760e272caf6550e6791eabe0af/pkg/signer/request-signature-v4.go#L60

kaovilai commented 1 month ago

doc'ing in https://github.com/vmware-tanzu/velero-plugin-for-aws/pull/219