Originally posted by **Wayne-H-Ha** March 19, 2024
We used to be able to create backup using velero 1.12.2 and aws plugin 1.8.2.
We tried velero 1.13.0 and plugin 1.9.0 and it failed so we switched back to older version.
We tried again with velero 1.13.1 and plugin 1.9.1 and it still fails. Any configuration change we need to make in order to use the new version?
We tried to find the backup in s3 and it didn't get uploaded there.
When we describe the backup, it returns:
```
velero-v1.13.1-linux-amd64/velero describe backup cp-20240319163110 | tail
Started: 2024-03-19 16:32:01 +0000 UTC
Completed:
Expiration: 2024-04-18 16:32:01 +0000 UTC
Total items to be backed up: 2871
Items backed up: 2871
Backup Volumes:
```
We believe the problem is a suffix "@aws" is added to key id? For example, aws_access_key_id = "3..0" but "3..0@aws" is passed to s3? Is there a configuration we can use to not having this suffix added?
```
cat /credentials/cloud
[default]
aws_access_key_id = "3..0"
aws_secret_access_key = "a..b"
```
It looks like you may have the wrong CRDs installed. BackupVolumeInfos wa sa new vale added to spec.target.kind for DownloadRequest in 1.13. If you're trying to run Velero 1.13 but have Velero 1.12 CRDs installed, that would explain the error.
Thanks for looking into this problem. Here is the error I found for the failed backup:
Mar 19 22:31:03 velero-58c946d54d-k5xdt velero info time="2024-03-20T02:30:53Z" level=info msg="Setting up backup store to persist the backup" backup=velero/cp-20240320023032 logSource="pkg/controller/backup_controller.go:729"
Mar 19 22:31:03 velero-58c946d54d-k5xdt velero error time="2024-03-20T02:30:53Z" level=error msg="Error uploading log file" backup=cp-20240320023032 bucket=codeengine-cp-dev-relint error="rpc error: code = Unknown desc = error putting object dev-relint-controlplane/backups/cp-20240320023032/cp-20240320023032-logs.gz: operation error S3: PutObject, https response error StatusCode: 403, RequestID: b54ad6b1-c6a4-443f-9e99-be04b978a9bf, HostID: , api error AccessDenied: Access Denied" error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:253" error.function="main.(*ObjectStore).PutObject" logSource="pkg/persistence/object_store.go:252" prefix=dev-relint-controlplane
Mar 19 22:31:03 velero-58c946d54d-k5xdt velero info time="2024-03-20T02:30:53Z" level=info msg="Initial backup processing complete, moving to Finalizing" backup=velero/cp-20240320023032 logSource="pkg/controller/backup_controller.go:743"
Mar 19 22:31:03 velero-58c946d54d-k5xdt velero error time="2024-03-20T02:30:53Z" level=error msg="backup failed" backuprequest=velero/cp-20240320023032 controller=backup error="rpc error: code = Unknown desc = error putting object dev-relint-controlplane/backups/cp-20240320023032/velero-backup.json: operation error S3: PutObject, https response error StatusCode: 403, RequestID: 74bbf24c-dc0a-43c2-8b92-3796929fd421, HostID: , api error AccessDenied: Access Denied" logSource="pkg/controller/backup_controller.go:288"
Mar 19 22:31:03 velero-58c946d54d-k5xdt velero info time="2024-03-20T02:30:53Z" level=info msg="Updating backup's final status" backuprequest=velero/cp-20240320023032 controller=backup logSource="pkg/controller/backup_controller.go:307"
The s3 support recommended:
We usually see a suffix of @aws in the access_key_id of HMAC access when the s3 signature/presigned URL is not correct. We suggest engaging Velero support to investigate if they have a behavior change on the s3 signature/presigned in their new version.
Thanks for the link to the documentation. As I mentioned earlier, velero 1.12.2 and aws plugin 1.8.2 backup used to work for us. So not sure why it stopped working when we upgrade to 1.13.0 and 1.9.0 or 1.13.1 and 1.9.1? Here is the velero install command we use for many versions of velero including 1.11 and earlier versions:
I can't think of any changes we've made to the way we handle uploads that would trigger new permission requirements between 1.12 and 1.13, although maybe there's something I'm not aware of. It may be worth creating a new bucket and making sure it has the recommended bucket policy in place to see whether this works, which will eliminate the possibility that something changed in the bucket itself.
The issue may relate to the AWS SDK version bump in the Velero AWS plugin version v1.9.
Could you give more information about your suspected @aws suffix?
Did you see that in the secret, the pod, or the Velero log?
IMO, this "@aws" may not be an issue. The 403 error code implies permission denied.
Is there any possibility of permission not being enough issue for the Velero role?
Velero vs aws plugin 1.12.2 vs 1.8.2 works 1.12.2 vs 1.9.1 fails 1.13.1 vs 1.8.2 works 1.13.1 vs 1.9.1 fails
So we suspect aws plugin 1.9.1 is adding "@aws" to end of key id so velero fails to upload backup to IBM Cloud Object Storage?
As I mentioned previously, we have tried the newest version of velero 1.13.1 vs the newest version of plugin 1.9.1 and it failed. But if we switch to older version of plugin 1.8.2 then it works. In both cases, we have the same permission.
Thanks for the link to the documentation. As I mentioned earlier, velero 1.12.2 and aws plugin 1.8.2 backup used to work for us. So not sure why it stopped working when we upgrade to 1.13.0 and 1.9.0 or 1.13.1 and 1.9.1?
Since aws-plugin v1.9.x, we've switched to aws-sdk-go-v2, so there might be compatibility issue. Some change in sdk-v2 makes IBM Object Storage think @aws was added. Is it possible to check IBM and let them explain how the remote_user was extracted?
I may look into the code, but I can't commit a fix b/c currently the plugin works with AWS-S3 and S3-Compatible storage (minio) in our pipeline.
We have the same issue as described here and we are using official Amazon S3. Let me know if you need any logs
@reasonerjt Yes, I will report to IBM Cloud Object Storage with your findings. But please also be informed that @Alwinius said he also has problem with Amazon S3.
@Wayne-H-Ha
So if @aws is not in the credentials file.
You will need to check with IBM where it comes from, they will need to check the code to find out.
I briefly checked the SDK and didn't find it adding the suffix.
@reasonerjt I just got the reply from IBM Cloud Object Storage (COS). I hope you understand the reply as I don't have enough knowledge to digest the information.
COS internal managed to capture debug logged requests for HTTP 403 for PUT. Specifically, the AWS signature does not match what we are expecting and so stop processing the request any further.
2024-04-02 03:30:32.319 DEBUG [etp579017959-19571]
{s3.auth:a4493674-ffdc-48c7-920c-2133c490c197} org.cleversafe.s3.auth.AwsAuthenticator -
Invalid AWS V4 Chunked Headers: Incorrect value for Content Hash on Chunked Put Request
Checking COS logs, they can see all HTTP 403 for PUT were for user_agent
"aws-sdk-go-v2/1.21.0 os/linux lang/go#1.21.6 md/GOOS#linux md/GOARCH#amd64 api/s3#1.40.0 ft/s3-transfer".
The write requests which succeeded for the bucket were for user_agent
"aws-sdk-go/1.44.253 (go1.20.10; linux; amd64) S3Manager."
This seems a dup to #7534 after the checksum header was added, and the way the request is verified on IBM S3-compatible storage is different from AWS S3.
Please check the workaround provided in v1.9.2 where you can update the BSL and the checksum will not be added.
The root cause seems a compatibility issue IBM needs to fix, as the same code works for AWS S3.
@reasonerjt Not sure if I specified the workaround correctly but I have tried it with velero 1.13.2 and plugin 1.9.2 and now I am getting the following error. Please advise.
@reasonerjt I have tried velero v1.14.0 and aws plugin v1.10.0 and I still have same problem. Please let me know how to avoid the problem? Here is what I found in the log:
This is the checksum algorithm I set in backup location. Please note I was told to set region to us-east-1 for all S3 regions including s3Url in eu-de:
@reasonerjt We have checked with IBM COS support team and checksumAlgorithm="" is not an option since IBM COS expects the checksum to be SHA256.
When we set checksumAlgorithm="SHA256", the request failed with error 403 access denied because the authentication username is tailing with @aws which indicates the signature that IBM COS calculated does not match with signature provided by Velero / AWS S3 SDK.
Can you let us know what AWS signature version the Velero uses or how do we figure it out? signatureVersion configuration in BSL defaults to 4 but is it same signature version that is used in conjunction with checksumAlgorithm to sign?
IBM COS says they expect signatureVersion 4 and you can find more details in:
But it started to fail when we upgrade Velero to 1.13 and AWS plugin to 1.9 and later Velero 1.14 and AWS plugin 1.10. So we have to use both Velero 1.13 and 1.14 with AWS plugin 1.8.
We were told we need to set checksumAlgorithm to "" when using newer version of AWS plugin but it didn't work. We also tried setting checksumAlgorithm to "SHA256" and it also didn't work.
What changes we need to make so we can use newer version of AWS plugin?
@mateusoliveira43 Thanks for providing your backup location spec so I can compare with mine. I have added config.profile and credential.name and key to mimic what you have. But my backup still fails.
I could be wrong but it looks like IBM S3 is expecting AWS4-HMAC-SHA256
@reasonerjt @weshayutin
The IBM Cloud Object Storage (COS) team is asking if it is possible that the Velero team can provide the actual request being sent to IBM COS and got rejected by the receiving side?
(oh, I'm just noticing that the docs suggest it the way you had it -- so the main question is whether you're seeing other "level=debug" logs. There should be many of them if log level is debug. If they're not, then we'll need to figure out why the setting isn't working. If there are, then we may need to look into what exactly should be logged here and which of those messages you're seeing and which you aren't.
@kaovilai Thanks for providing the image with debug logging enabled. I have reproduced the problem and sent the new logs to IBM COS for them to investigate.
IBM COS said since our bucket has retention policy set, setting checksumAlgorithm to "" will not work for us. They need to implement sdkv2 support in IBM COS.
Discussed in https://github.com/vmware-tanzu/velero/discussions/7542
It looks like you may have the wrong CRDs installed. BackupVolumeInfos wa sa new vale added to
spec.target.kind
for DownloadRequest in 1.13. If you're trying to run Velero 1.13 but have Velero 1.12 CRDs installed, that would explain the error.Thanks for the quick response. I found in the doc I can run:
So I run the above using velero 1.12.2 and 1.13.1 and as you said, I found BackupVolumeInfos in the output produced by 1.13.1:
My next question is how do I update CRDS from 1.12.2 to 1.13.1?
Here is one doc that you could reference
Thanks for the link to the doc. I have run the following:
But the backup still fails:
Maybe it is still adding "@aws" suffix to the key id?
Not sure about the
@aws
suffix. IMO, there is no need to add that.Could you post the error information of the failed backup?
Thanks for looking into this problem. Here is the error I found for the failed backup:
The s3 support recommended:
We usually see a suffix of @aws in the access_key_id of HMAC access when the s3 signature/presigned URL is not correct. We suggest engaging Velero support to investigate if they have a behavior change on the s3 signature/presigned in their new version.
What is the backup repository's backend? Is it the AWS S3 or an on-premise OSS?
The S3 backend is IBM Cloud Object Storage that behaves like AWS S3.
Hmm. Looks like you may have the wrong bucket permissions for your s3 bucket. See the bucket policies section at https://github.com/vmware-tanzu/velero-plugin-for-aws/blob/main/README.md and compare with what you have.
Thanks for the link to the documentation. As I mentioned earlier, velero 1.12.2 and aws plugin 1.8.2 backup used to work for us. So not sure why it stopped working when we upgrade to 1.13.0 and 1.9.0 or 1.13.1 and 1.9.1? Here is the velero install command we use for many versions of velero including 1.11 and earlier versions:
I can't think of any changes we've made to the way we handle uploads that would trigger new permission requirements between 1.12 and 1.13, although maybe there's something I'm not aware of. It may be worth creating a new bucket and making sure it has the recommended bucket policy in place to see whether this works, which will eliminate the possibility that something changed in the bucket itself.
We tried the following combinations:
Velero vs aws plugin 1.12.2 vs 1.8.2 works 1.12.2 vs 1.9.1 fails 1.13.1 vs 1.8.2 works 1.13.1 vs 1.9.1 fails
So we suspect aws plugin 1.9.1 is adding "@aws" to end of key id so velero fails to upload backup to IBM Cloud Object Storage?
The issue may relate to the AWS SDK version bump in the Velero AWS plugin version v1.9. Could you give more information about your suspected
@aws
suffix?Did you see that in the secret, the pod, or the Velero log?
I contacted IBM Cloud Object Storage and they said they found the following in their log (note suffix "@aws" at end of remote_user):
We have the same issue as described here and we are using official Amazon S3. Let me know if you need any logs
IMO, this "@aws" may not be an issue. The 403 error code implies permission denied. Is there any possibility of permission not being enough issue for the Velero role?
As I mentioned previously, we have tried the newest version of velero 1.13.1 vs the newest version of plugin 1.9.1 and it failed. But if we switch to older version of plugin 1.8.2 then it works. In both cases, we have the same permission.
@Wayne-H-Ha
Since aws-plugin v1.9.x, we've switched to aws-sdk-go-v2, so there might be compatibility issue. Some change in sdk-v2 makes IBM Object Storage think
@aws
was added. Is it possible to check IBM and let them explain how theremote_user
was extracted?I may look into the code, but I can't commit a fix b/c currently the plugin works with AWS-S3 and S3-Compatible storage (minio) in our pipeline.
@reasonerjt Yes, I will report to IBM Cloud Object Storage with your findings. But please also be informed that @Alwinius said he also has problem with Amazon S3.
I also experienced the problem in IBM Cloud with aws plugin v.1.9.1
@reasonerjt IBM Cloud Object Storage team replied:
The expected of the remote user should be the access key ID of the HMAC without tailing with the @aws.
For example: "3f3dad27c65d41b4835b8a3be6d91cb0@aws", the ""3f3dad27c65d41b4835b8a3be6d91cb0" is the expected access key ID.
@Wayne-H-Ha So if
@aws
is not in the credentials file. You will need to check with IBM where it comes from, they will need to check the code to find out. I briefly checked the SDK and didn't find it adding the suffix.@reasonerjt I just got the reply from IBM Cloud Object Storage (COS). I hope you understand the reply as I don't have enough knowledge to digest the information.
COS internal managed to capture debug logged requests for HTTP 403 for PUT. Specifically, the AWS signature does not match what we are expecting and so stop processing the request any further.
Request_id 1) 0ed2fc0b-acf8-4d05-b003-dd5a1bf1b072:
2024-04-02 03:30:32.330 DEBUG [etp466364426-20827] {s3.auth:56ac6033-f67f-4ba2-a2e0-b7b65350824d} org.cleversafe.s3.auth.AwsAuthenticator - Invalid AWS V4 Chunked Headers: Incorrect value for Content Hash on Chunked Put Request
in the other:
Request_id 2) 5982df29-85a9-4492-9573-54aaba4b484e:
2024-04-02 03:30:32.319 DEBUG [etp579017959-19571] {s3.auth:a4493674-ffdc-48c7-920c-2133c490c197} org.cleversafe.s3.auth.AwsAuthenticator - Invalid AWS V4 Chunked Headers: Incorrect value for Content Hash on Chunked Put Request
Checking COS logs, they can see all HTTP 403 for PUT were for user_agent "aws-sdk-go-v2/1.21.0 os/linux lang/go#1.21.6 md/GOOS#linux md/GOARCH#amd64 api/s3#1.40.0 ft/s3-transfer". The write requests which succeeded for the bucket were for user_agent "aws-sdk-go/1.44.253 (go1.20.10; linux; amd64) S3Manager."
@Wayne-H-Ha
This seems a dup to #7534 after the checksum header was added, and the way the request is verified on IBM S3-compatible storage is different from AWS S3.
Please check the workaround provided in v1.9.2 where you can update the BSL and the checksum will not be added.
The root cause seems a compatibility issue IBM needs to fix, as the same code works for AWS S3.
@reasonerjt Thanks for letting me know there is a workaround in v1.9.2. We run velero install as follows:
Does that mean we need to add the workaround like the following?
Note our region could be us-east or us-south. Each region has 3 zones e.g. us-east-1, 2 or 3.
@reasonerjt Not sure if I specified the workaround correctly but I have tried it with velero 1.13.2 and plugin 1.9.2 and now I am getting the following error. Please advise.
That seems related to https://github.com/vmware-tanzu/velero/issues/7693
WE have the very same issue with velero 6.6.0 / 1.9.2 and minio as the S3 on prem storage.
Not sure how to apply
checksumAlgorithm
via the helm chart deployment ( there is no option in https://artifacthub.io/packages/helm/vmware-tanzu/velero?modal=values)Our config looks like this right now
Similar to the people before us, it used to work with the older plugin.
Thanks for any help
For me applying the non document
checksumAlgorithm
flag fixes the issueSame issue with Netapp StorageGrid, can be fixed by setting "checksumAlgorithm" to an empty string.
@reasonerjt I have tried velero v1.14.0 and aws plugin v1.10.0 and I still have same problem. Please let me know how to avoid the problem? Here is what I found in the log:
This is the checksum algorithm I set in backup location. Please note I was told to set region to us-east-1 for all S3 regions including s3Url in eu-de:
Should I try other supported values "CRC32", "CRC32C", "SHA1", "SHA256"?
@reasonerjt We have checked with IBM COS support team and checksumAlgorithm="" is not an option since IBM COS expects the checksum to be SHA256.
When we set checksumAlgorithm="SHA256", the request failed with error 403 access denied because the authentication username is tailing with @aws which indicates the signature that IBM COS calculated does not match with signature provided by Velero / AWS S3 SDK.
Can you let us know what AWS signature version the Velero uses or how do we figure it out? signatureVersion configuration in BSL defaults to 4 but is it same signature version that is used in conjunction with checksumAlgorithm to sign?
IBM COS says they expect signatureVersion 4 and you can find more details in:
https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-hmac-signature
I could be wrong but it looks like IBM S3 is expecting
AWS4-HMAC-SHA256
@weshayutin We have been using the following backup location config up to Velero 1.12 and AWS plugin 1.8:
But it started to fail when we upgrade Velero to 1.13 and AWS plugin to 1.9 and later Velero 1.14 and AWS plugin 1.10. So we have to use both Velero 1.13 and 1.14 with AWS plugin 1.8.
We were told we need to set checksumAlgorithm to "" when using newer version of AWS plugin but it didn't work. We also tried setting checksumAlgorithm to "SHA256" and it also didn't work.
What changes we need to make so we can use newer version of AWS plugin?
@Wayne-H-Ha I also use IBM COS and my problem was fixed with empty checksum algorithm for Velero 1.14 and aws plugin 1.10
Here is how my BSL looks like
Reading your config, only thing I spotted different was s3Url, should it be
https://s3.direct.us-east-1.cloud-object-storage.appdomain.cloud
?To grab the URL I am going to use with Velero, I do the following
https://
prefix)Do you do the same steps?
Did you spot any difference in my BSL compared to yours?
@mateusoliveira43 Thanks for providing your backup location spec so I can compare with mine. I have added config.profile and credential.name and key to mimic what you have. But my backup still fails.
Here is my backup location spec:
Here is my backup error message:
So it looks like it is looking for some checksum algorithm e.g. SHA256?
@reasonerjt @weshayutin
The IBM Cloud Object Storage (COS) team is asking if it is possible that the Velero team can provide the actual request being sent to IBM COS and got rejected by the receiving side?
If your loglevel is debug, you should be able to see the full request.
You had the RequestID earlier tho, can they not see that?
@kaovilai Not sure if we had the log-level set correctly but I found it documented in: https://github.com/vmware-tanzu/velero/blob/v1.14.0/site/content/docs/v1.14/troubleshooting.md#getting-velero-debug-logs And here is our setting:
With the above setting, I can't find what is being sent in the Velero log and IBM COS can't find it in their log either.
@Wayne-H-Ha are you seeing other debug logs? If not, it might be better to replace the last two lines with a combined
--log-level=debug
(oh, I'm just noticing that the docs suggest it the way you had it -- so the main question is whether you're seeing other "level=debug" logs. There should be many of them if log level is debug. If they're not, then we'll need to figure out why the setting isn't working. If there are, then we may need to look into what exactly should be logged here and which of those messages you're seeing and which you aren't.
I found out the sdkv2 by default do not produce logs.. I'm PRing to aws plugin in a bit.
Have to set clientlogmode https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/logging/#clientlogmode
I see more than 40 K entries for level=debug and more than 10 K entries for level=info and only 2 entries for level=error:
@Wayne-H-Ha try this image with debug logging enabled.
ghcr.io/kaovilai/velero-plugin-for-aws:sdk-v2-logging
from https://github.com/vmware-tanzu/velero-plugin-for-aws/pull/207
Then relay that info to IBM COS
@kaovilai Thanks for providing the image with debug logging enabled. I have reproduced the problem and sent the new logs to IBM COS for them to investigate.
Let us know of any updates.
IBM COS said since our bucket has retention policy set, setting checksumAlgorithm to "" will not work for us. They need to implement sdkv2 support in IBM COS.
@Wayne-H-Ha So does this mean a new version of IBM COS will be needed? Is this on the roadmap?