velero backup create fails to upload backup to s3 using aws plugin

sseago commented 5 months ago

It looks like you may have the wrong CRDs installed. BackupVolumeInfos wa sa new vale added to spec.target.kind for DownloadRequest in 1.13. If you're trying to run Velero 1.13 but have Velero 1.12 CRDs installed, that would explain the error.

Wayne-H-Ha commented 5 months ago

Thanks for the quick response. I found in the doc I can run:

velero install --crds-only --dry-run -o yaml

So I run the above using velero 1.12.2 and 1.13.1 and as you said, I found BackupVolumeInfos in the output produced by 1.13.1:

% diff -w crds.1.13.1 crds.1.12.2 | grep BackupVolumeInfos        
<                       - BackupVolumeInfos

My next question is how do I update CRDS from 1.12.2 to 1.13.1?

qiuming-best commented 5 months ago

Here is one doc that you could reference

Wayne-H-Ha commented 5 months ago

Thanks for the link to the doc. I have run the following:

velero-v1.13.1-linux-amd64/velero install --crds-only --dry-run -o yaml | kubectl apply -f -

velero-v1.13.1-linux-amd64/velero backup create cp-20240320020119

But the backup still fails:

velero-v1.13.1-linux-amd64/velero backup describe cp-20240320020119
Name:         cp-20240320020119
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.27.11+IKS
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=27
Phase:  Failed (run `velero backup logs cp-20240320020119` for more information)
Namespaces:
  Included:  *
  Excluded:  <none>
Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto
Label selector:  <none>
Or label selector:  <none>
Storage Location:  default
Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero
TTL:  720h0m0s
CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s
Hooks:  <none>
Backup Format Version:  1.1.0
Started:    2024-03-20 02:06:35 +0000 UTC
Completed:  <n/a>
Expiration:  2024-04-19 02:06:35 +0000 UTC
Total items to be backed up:  2989
Items backed up:              2989
Backup Volumes:
  Velero-Native Snapshots: <none included>
  CSI Snapshots: <none included or not detectable>
  Pod Volume Backups: <none included>
HooksAttempted:  0
HooksFailed:     0

Maybe it is still adding "@aws" suffix to the key id?

blackpiglet commented 5 months ago

Not sure about the @aws suffix. IMO, there is no need to add that.

Could you post the error information of the failed backup?

Wayne-H-Ha commented 5 months ago

Thanks for looking into this problem. Here is the error I found for the failed backup:

Mar 19 22:31:03 velero-58c946d54d-k5xdt velero info time="2024-03-20T02:30:53Z" level=info msg="Setting up backup store to persist the backup" backup=velero/cp-20240320023032 logSource="pkg/controller/backup_controller.go:729"
Mar 19 22:31:03 velero-58c946d54d-k5xdt velero error time="2024-03-20T02:30:53Z" level=error msg="Error uploading log file" backup=cp-20240320023032 bucket=codeengine-cp-dev-relint error="rpc error: code = Unknown desc = error putting object dev-relint-controlplane/backups/cp-20240320023032/cp-20240320023032-logs.gz: operation error S3: PutObject, https response error StatusCode: 403, RequestID: b54ad6b1-c6a4-443f-9e99-be04b978a9bf, HostID: , api error AccessDenied: Access Denied" error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:253" error.function="main.(*ObjectStore).PutObject" logSource="pkg/persistence/object_store.go:252" prefix=dev-relint-controlplane
Mar 19 22:31:03 velero-58c946d54d-k5xdt velero info time="2024-03-20T02:30:53Z" level=info msg="Initial backup processing complete, moving to Finalizing" backup=velero/cp-20240320023032 logSource="pkg/controller/backup_controller.go:743"
Mar 19 22:31:03 velero-58c946d54d-k5xdt velero error time="2024-03-20T02:30:53Z" level=error msg="backup failed" backuprequest=velero/cp-20240320023032 controller=backup error="rpc error: code = Unknown desc = error putting object dev-relint-controlplane/backups/cp-20240320023032/velero-backup.json: operation error S3: PutObject, https response error StatusCode: 403, RequestID: 74bbf24c-dc0a-43c2-8b92-3796929fd421, HostID: , api error AccessDenied: Access Denied" logSource="pkg/controller/backup_controller.go:288"
Mar 19 22:31:03 velero-58c946d54d-k5xdt velero info time="2024-03-20T02:30:53Z" level=info msg="Updating backup's final status" backuprequest=velero/cp-20240320023032 controller=backup logSource="pkg/controller/backup_controller.go:307"

The s3 support recommended:

We usually see a suffix of @aws in the access_key_id of HMAC access when the s3 signature/presigned URL is not correct. We suggest engaging Velero support to investigate if they have a behavior change on the s3 signature/presigned in their new version.

blackpiglet commented 5 months ago

What is the backup repository's backend? Is it the AWS S3 or an on-premise OSS?

Wayne-H-Ha commented 5 months ago

The S3 backend is IBM Cloud Object Storage that behaves like AWS S3.

sseago commented 5 months ago

Hmm. Looks like you may have the wrong bucket permissions for your s3 bucket. See the bucket policies section at https://github.com/vmware-tanzu/velero-plugin-for-aws/blob/main/README.md and compare with what you have.

Wayne-H-Ha commented 5 months ago

Thanks for the link to the documentation. As I mentioned earlier, velero 1.12.2 and aws plugin 1.8.2 backup used to work for us. So not sure why it stopped working when we upgrade to 1.13.0 and 1.9.0 or 1.13.1 and 1.9.1? Here is the velero install command we use for many versions of velero including 1.11 and earlier versions:

      /tmp/velero-${RELEASE}-linux-amd64/velero install \
      --image "${REGISTRY_PATH}/velero:${VELERO_IMAGE_TAG}"  \
      --provider aws \
      --plugins ${REGISTRY_PATH}/velero-plugin-for-aws-amd64:${VELERO_PLUGIN_IMAGE_TAG} \
      --bucket ${COS_BUCKET}  \
      --prefix ${COS_PREFIX} \
      --secret-file /tmp/cos-credentials \
      --use-volume-snapshots=false \
      --backup-location-config region=us-east-1,s3ForcePathStyle="true",s3Url=${COS_ENDPOINT} \
      --velero-pod-cpu-request "700m" \
      --velero-pod-mem-request "$MEM_REQUEST" \
      --velero-pod-cpu-limit "700m" \
      --velero-pod-mem-limit "$MEM_LIMIT"

sseago commented 5 months ago

I can't think of any changes we've made to the way we handle uploads that would trigger new permission requirements between 1.12 and 1.13, although maybe there's something I'm not aware of. It may be worth creating a new bucket and making sure it has the recommended bucket policy in place to see whether this works, which will eliminate the possibility that something changed in the bucket itself.

Wayne-H-Ha commented 5 months ago

We tried the following combinations:

Velero vs aws plugin 1.12.2 vs 1.8.2 works 1.12.2 vs 1.9.1 fails 1.13.1 vs 1.8.2 works 1.13.1 vs 1.9.1 fails

So we suspect aws plugin 1.9.1 is adding "@aws" to end of key id so velero fails to upload backup to IBM Cloud Object Storage?

blackpiglet commented 5 months ago

The issue may relate to the AWS SDK version bump in the Velero AWS plugin version v1.9. Could you give more information about your suspected @aws suffix?

Did you see that in the secret, the pod, or the Velero log?

Wayne-H-Ha commented 5 months ago

I contacted IBM Cloud Object Storage and they said they found the following in their log (note suffix "@aws" at end of remote_user):

orig_timestamp  Mar 19, 2024 @ 14:31:07.000
container_name  codeengine-cp-dev-relint
request_type    REST.PUT.OBJECT
access_status   403
remote_user     3f3dad27c65d41b4835b8a3be6d91cb0@aws
credential_type hmac
user_agent      aws-sdk-go-v2/1.21.0 os/linux lang/go#1.21.6

Alwinius commented 5 months ago

We have the same issue as described here and we are using official Amazon S3. Let me know if you need any logs

blackpiglet commented 5 months ago

IMO, this "@aws" may not be an issue. The 403 error code implies permission denied. Is there any possibility of permission not being enough issue for the Velero role?

Wayne-H-Ha commented 5 months ago

We tried the following combinations:

Velero vs aws plugin 1.12.2 vs 1.8.2 works 1.12.2 vs 1.9.1 fails 1.13.1 vs 1.8.2 works 1.13.1 vs 1.9.1 fails

So we suspect aws plugin 1.9.1 is adding "@aws" to end of key id so velero fails to upload backup to IBM Cloud Object Storage?

As I mentioned previously, we have tried the newest version of velero 1.13.1 vs the newest version of plugin 1.9.1 and it failed. But if we switch to older version of plugin 1.8.2 then it works. In both cases, we have the same permission.

reasonerjt commented 5 months ago

@Wayne-H-Ha

Thanks for the link to the documentation. As I mentioned earlier, velero 1.12.2 and aws plugin 1.8.2 backup used to work for us. So not sure why it stopped working when we upgrade to 1.13.0 and 1.9.0 or 1.13.1 and 1.9.1?

Since aws-plugin v1.9.x, we've switched to aws-sdk-go-v2, so there might be compatibility issue. Some change in sdk-v2 makes IBM Object Storage think @aws was added. Is it possible to check IBM and let them explain how the remote_user was extracted?

I may look into the code, but I can't commit a fix b/c currently the plugin works with AWS-S3 and S3-Compatible storage (minio) in our pipeline.

Wayne-H-Ha commented 5 months ago

We have the same issue as described here and we are using official Amazon S3. Let me know if you need any logs

@reasonerjt Yes, I will report to IBM Cloud Object Storage with your findings. But please also be informed that @Alwinius said he also has problem with Amazon S3.

mateusoliveira43 commented 5 months ago

I also experienced the problem in IBM Cloud with aws plugin v.1.9.1

Wayne-H-Ha commented 5 months ago

@reasonerjt IBM Cloud Object Storage team replied:

The expected of the remote user should be the access key ID of the HMAC without tailing with the @aws.

For example: "3f3dad27c65d41b4835b8a3be6d91cb0@aws", the ""3f3dad27c65d41b4835b8a3be6d91cb0" is the expected access key ID.

reasonerjt commented 5 months ago

@Wayne-H-Ha So if @aws is not in the credentials file. You will need to check with IBM where it comes from, they will need to check the code to find out. I briefly checked the SDK and didn't find it adding the suffix.

Wayne-H-Ha commented 5 months ago

@reasonerjt I just got the reply from IBM Cloud Object Storage (COS). I hope you understand the reply as I don't have enough knowledge to digest the information.

COS internal managed to capture debug logged requests for HTTP 403 for PUT. Specifically, the AWS signature does not match what we are expecting and so stop processing the request any further.

Request_id 1) 0ed2fc0b-acf8-4d05-b003-dd5a1bf1b072:

2024-04-02 03:30:32.330 DEBUG [etp466364426-20827] {s3.auth:56ac6033-f67f-4ba2-a2e0-b7b65350824d} org.cleversafe.s3.auth.AwsAuthenticator - Invalid AWS V4 Chunked Headers: Incorrect value for Content Hash on Chunked Put Request

in the other:

Request_id 2) 5982df29-85a9-4492-9573-54aaba4b484e:

2024-04-02 03:30:32.319 DEBUG [etp579017959-19571] {s3.auth:a4493674-ffdc-48c7-920c-2133c490c197} org.cleversafe.s3.auth.AwsAuthenticator - Invalid AWS V4 Chunked Headers: Incorrect value for Content Hash on Chunked Put Request

Checking COS logs, they can see all HTTP 403 for PUT were for user_agent "aws-sdk-go-v2/1.21.0 os/linux lang/go#1.21.6 md/GOOS#linux md/GOARCH#amd64 api/s3#1.40.0 ft/s3-transfer". The write requests which succeeded for the bucket were for user_agent "aws-sdk-go/1.44.253 (go1.20.10; linux; amd64) S3Manager."

reasonerjt commented 4 months ago

@Wayne-H-Ha

This seems a dup to #7534 after the checksum header was added, and the way the request is verified on IBM S3-compatible storage is different from AWS S3.

Please check the workaround provided in v1.9.2 where you can update the BSL and the checksum will not be added.

The root cause seems a compatibility issue IBM needs to fix, as the same code works for AWS S3.

Wayne-H-Ha commented 4 months ago

@reasonerjt Thanks for letting me know there is a workaround in v1.9.2. We run velero install as follows:

      velero install \
      --image "${REGISTRY_PATH}/velero:${VELERO_IMAGE_TAG}"  \
      --provider aws \
      --plugins ${REGISTRY_PATH}/velero-plugin-for-aws-amd64:${VELERO_PLUGIN_IMAGE_TAG} \
      --bucket ${COS_BUCKET}  \
      --prefix ${COS_PREFIX} \
      --secret-file /tmp/cos-credentials \
      --use-volume-snapshots=false \
      --backup-location-config region=us-east-1,s3ForcePathStyle="true",s3Url=${COS_ENDPOINT} \
     etc...

Does that mean we need to add the workaround like the following?

--backup-location-config region=us-east-1,s3ForcePathStyle="true",checksumAlgorithm="",s3Url=${COS_ENDPOINT} \

Note our region could be us-east or us-south. Each region has 3 zones e.g. us-east-1, 2 or 3.

Wayne-H-Ha commented 4 months ago

@reasonerjt Not sure if I specified the workaround correctly but I have tried it with velero 1.13.2 and plugin 1.9.2 and now I am getting the following error. Please advise.

velero backup-location get --output json | grep '"spec"' -A7
  "spec": {
    "provider": "aws",
    "config": {
      "checksumAlgorithm": "",
      "region": "us-east-1",
      "s3ForcePathStyle": "true",
      "s3Url": "https://s3.direct.eu-de.cloud-object-storage.appdomain.cloud"
    },

Apr 22 12:30:47 velero-7df6cc8659-sv8xb velero error time="2024-04-22T16:30:37Z" level=error 
msg="Error uploading log file" backup=cp-20240422163016 bucket=codeengine-cp-dev-relint 
error="rpc error: code = Unknown desc = error putting object dev-relint-controlplane/backups/cp-20240422163016/cp-20240422163016-logs.gz: 
operation error S3: PutObject, https response error StatusCode: 400, 
RequestID: dee21b80-92e7-4beb-80d2-7c0d611b1fbe, HostID: , api error MissingDigest: Missing required content hash for this request: Content-MD5 or x-amz-content-sha256" 
error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:266" error.function="main.(*ObjectStore).PutObject" 
logSource="pkg/persistence/object_store.go:252" prefix=dev-relint-controlplane

Apr 22 12:30:47 velero-7df6cc8659-sv8xb velero error time="2024-04-22T16:30:37Z" level=error 
msg="backup failed" backuprequest=velero/cp-20240422163016 controller=backup 
error="rpc error: code = Unknown desc = error putting object dev-relint-controlplane/backups/cp-20240422163016/velero-backup.json: 
operation error S3: PutObject, https response error StatusCode: 400, 
RequestID: 035d76b2-feae-47f4-9c11-5d48a60c184c, HostID: , api error MissingDigest: Missing required content hash for this request: Content-MD5 or x-amz-content-sha256" 
logSource="pkg/controller/backup_controller.go:288"

Alwinius commented 4 months ago

That seems related to https://github.com/vmware-tanzu/velero/issues/7693

EugenMayer commented 3 months ago

WE have the very same issue with velero 6.6.0 / 1.9.2 and minio as the S3 on prem storage.

level=error msg="backup failed" backuprequest=velero/daily-01 controller=backup error="rpc error: code = Unknown desc = error putting object backups/daily-01/velero-backup.json: operation error S3: PutObject, https response error StatusCode: 400, RequestID: 17D7E06EADC3B5BC, HostID: fd7a06ab-a47a-46ae-8a0a-9299a6500520, api error XAmzContentSHA256Mismatch: The provided 'x-amz-content-sha256' header does not match what was computed." logSource="pkg/controller/backup_controller.go:288"

Not sure how to apply checksumAlgorithm via the helm chart deployment ( there is no option in https://artifacthub.io/packages/helm/vmware-tanzu/velero?modal=values)

Our config looks like this right now

configuration:
  backupStorageLocation:
    - name: default
      provider: aws
      bucket: velero
      default: true
      config:
        region: de
        s3ForcePathStyle: "true"
        s3Url: ${minio_uri}

Similar to the people before us, it used to work with the older plugin.

Thanks for any help

EugenMayer commented 3 months ago

For me applying the non document checksumAlgorithm flag fixes the issue

configuration:
  backupStorageLocation:
    - name: default
      provider: aws
      bucket: velero
      default: true
      config:
        region: de
        s3ForcePathStyle: "true"
        s3Url: ${minio_uri}
        checksumAlgorithm: ""

kurld commented 2 months ago

Same issue with Netapp StorageGrid, can be fixed by setting "checksumAlgorithm" to an empty string.

Wayne-H-Ha commented 2 months ago

@reasonerjt I have tried velero v1.14.0 and aws plugin v1.10.0 and I still have same problem. Please let me know how to avoid the problem? Here is what I found in the log:

Jun 19 14:31:00 velero-c8bf7fc9c-fs6p7 velero error
time="2024-06-19T18:31:00Z"
level=error
msg="Error uploading log file" backup=cp-20240619183045 bucket=codeengine-cp-dev-relint
error="rpc error: code = Unknown
desc = error putting object dev-relint-controlplane/backups/cp-20240619183045/cp-20240619183045-logs.gz:
operation error S3: PutObject, https response error StatusCode: 400,
RequestID: 574e0983-d9d9-474e-ae6e-773e1efda6ea,
HostID: , api error MissingDigest: Missing required content hash for this request: Content-MD5 or x-amz-content-sha256"
error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:280" error.function="main.(*ObjectStore).PutObject"
logSource="pkg/persistence/object_store.go:256" prefix=dev-relint-controlplane

Jun 19 14:31:00 velero-c8bf7fc9c-fs6p7 velero error
time="2024-06-19T18:31:00Z"
level=error
msg="backup failed" backuprequest=velero/cp-20240619183045 controller=backup
error="rpc error: code = Unknown
desc = error putting object dev-relint-controlplane/backups/cp-20240619183045/velero-backup.json:
operation error S3: PutObject, https response error StatusCode: 400,
RequestID: d2352968-cd33-424a-b940-10cfe374f437,
HostID: , api error MissingDigest: Missing required content hash for this request: Content-MD5 or x-amz-content-sha256"
logSource="pkg/controller/backup_controller.go:288"

This is the checksum algorithm I set in backup location. Please note I was told to set region to us-east-1 for all S3 regions including s3Url in eu-de:

velero backup-location get --output json | grep '"spec"' -A7
  "spec": {
    "provider": "aws",
    "config": {
      "checksumAlgorithm": "",
      "region": "us-east-1",
      "s3ForcePathStyle": "true",
      "s3Url": "https://s3.direct.eu-de.cloud-object-storage.appdomain.cloud"
    },

Should I try other supported values "CRC32", "CRC32C", "SHA1", "SHA256"?

Wayne-H-Ha commented 2 months ago

@reasonerjt We have checked with IBM COS support team and checksumAlgorithm="" is not an option since IBM COS expects the checksum to be SHA256.

When we set checksumAlgorithm="SHA256", the request failed with error 403 access denied because the authentication username is tailing with @aws which indicates the signature that IBM COS calculated does not match with signature provided by Velero / AWS S3 SDK.

Can you let us know what AWS signature version the Velero uses or how do we figure it out? signatureVersion configuration in BSL defaults to 4 but is it same signature version that is used in conjunction with checksumAlgorithm to sign?

IBM COS says they expect signatureVersion 4 and you can find more details in:

https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-hmac-signature

weshayutin commented 2 months ago

# assemble string-to-sign
hashing_algorithm = 'AWS4-HMAC-SHA256'
credential_scope = datestamp + '/' + region + '/' + 's3' + '/' + 'aws4_request'
sts = (hashing_algorithm + '\n' +
       timestamp + '\n' +
       credential_scope + '\n' +
       hashlib.sha256(standardized_request).hexdigest())

I could be wrong but it looks like IBM S3 is expecting AWS4-HMAC-SHA256

Wayne-H-Ha commented 2 months ago

@weshayutin We have been using the following backup location config up to Velero 1.12 and AWS plugin 1.8:

velero get backup-locations --output json | grep '"spec' -A6
  "spec": {
    "provider": "aws",
    "config": {
      "region": "us-east-1",
      "s3ForcePathStyle": "true",
      "s3Url": "https://s3.direct.eu-de.cloud-object-storage.appdomain.cloud"
    },

But it started to fail when we upgrade Velero to 1.13 and AWS plugin to 1.9 and later Velero 1.14 and AWS plugin 1.10. So we have to use both Velero 1.13 and 1.14 with AWS plugin 1.8.

We were told we need to set checksumAlgorithm to "" when using newer version of AWS plugin but it didn't work. We also tried setting checksumAlgorithm to "SHA256" and it also didn't work.

What changes we need to make so we can use newer version of AWS plugin?

mateusoliveira43 commented 2 months ago

@Wayne-H-Ha I also use IBM COS and my problem was fixed with empty checksum algorithm for Velero 1.14 and aws plugin 1.10

Here is how my BSL looks like

spec:
  config:
    checksumAlgorithm: ''
    profile: default
    region: <my-region>
    s3ForcePathStyle: 'true'
    s3Url: 'https://s3.direct.<my-region>.cloud-object-storage.appdomain.cloud'
  credential:
    key: cloud
    name: cloud-credentials
  default: true
  objectStorage:
    bucket: <my-bucket-name>
    prefix: velero
  provider: aws

Reading your config, only thing I spotted different was s3Url, should it be https://s3.direct.us-east-1.cloud-object-storage.appdomain.cloud?

To grab the URL I am going to use with Velero, I do the following

Go to my Cloud Object Storage (COS) instance page
Click on the Bucket I want to use
On the Bucket page, click on Configuration tab
Go to Endpoints section and select one of them (adding https:// prefix)

Do you do the same steps?

Did you spot any difference in my BSL compared to yours?

Wayne-H-Ha commented 2 months ago

@mateusoliveira43 Thanks for providing your backup location spec so I can compare with mine. I have added config.profile and credential.name and key to mimic what you have. But my backup still fails.

Here is my backup location spec:

velero get backup-locations --output json    | jq .spec
{
  "provider": "aws",
  "config": {
    "checksumAlgorithm": "",
    "profile": "default",
    "region": "eu-de",
    "s3ForcePathStyle": "true",
    "s3Url": "https://s3.direct.eu-de.cloud-object-storage.appdomain.cloud"
  },
  "credential": {
    "name": "cloud-credentials",
    "key": "cloud"
  },
  "objectStorage": {
    "bucket": "codeengine-cp-dev-relint",
    "prefix": "dev-relint-controlplane"
  },
  "default": true
}

Here is my backup error message:

rpc error: code = Unknown 
desc = error putting object dev-relint-controlplane/backups/wayne-test-3/wayne-test-3-logs.gz: 
operation error S3: PutObject, https response error StatusCode: 400, 
RequestID: f8a50cfe-3bf5-405c-af03-b4f689781e6b, HostID: , 
api error MissingDigest: Missing required content hash for this request: Content-MD5 or x-amz-content-sha256

So it looks like it is looking for some checksum algorithm e.g. SHA256?

Wayne-H-Ha commented 2 months ago

# assemble string-to-sign
hashing_algorithm = 'AWS4-HMAC-SHA256'
credential_scope = datestamp + '/' + region + '/' + 's3' + '/' + 'aws4_request'
sts = (hashing_algorithm + '\n' +
       timestamp + '\n' +
       credential_scope + '\n' +
       hashlib.sha256(standardized_request).hexdigest())

I could be wrong but it looks like IBM S3 is expecting AWS4-HMAC-SHA256

@reasonerjt @weshayutin

The IBM Cloud Object Storage (COS) team is asking if it is possible that the Velero team can provide the actual request being sent to IBM COS and got rejected by the receiving side?

kaovilai commented 2 months ago

If your loglevel is debug, you should be able to see the full request.

kaovilai commented 2 months ago

You had the RequestID earlier tho, can they not see that?

Wayne-H-Ha commented 2 months ago

@kaovilai Not sure if we had the log-level set correctly but I found it documented in: https://github.com/vmware-tanzu/velero/blob/v1.14.0/site/content/docs/v1.14/troubleshooting.md#getting-velero-debug-logs And here is our setting:

kubectl get deployment/velero -n velero --output yaml | grep containers -A6
      containers:
      - args:
        - server
        - --features=
        - --uploader-type=kopia
        - --log-level
        - debug

With the above setting, I can't find what is being sent in the Velero log and IBM COS can't find it in their log either.

sseago commented 2 months ago

@Wayne-H-Ha are you seeing other debug logs? If not, it might be better to replace the last two lines with a combined --log-level=debug

sseago commented 2 months ago

(oh, I'm just noticing that the docs suggest it the way you had it -- so the main question is whether you're seeing other "level=debug" logs. There should be many of them if log level is debug. If they're not, then we'll need to figure out why the setting isn't working. If there are, then we may need to look into what exactly should be logged here and which of those messages you're seeing and which you aren't.

kaovilai commented 2 months ago

I found out the sdkv2 by default do not produce logs.. I'm PRing to aws plugin in a bit.

kaovilai commented 2 months ago

Have to set clientlogmode https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/logging/#clientlogmode

Wayne-H-Ha commented 2 months ago

I see more than 40 K entries for level=debug and more than 10 K entries for level=info and only 2 entries for level=error:

cat velero-bundle/kubecapture/core_v1/velero/velero-fd9576677-jv989/velero/velero.log | wc -l
52957

cat velero-bundle/kubecapture/core_v1/velero/velero-fd9576677-jv989/velero/velero.log | egrep "level=debug" | wc -l
42203

cat velero-bundle/kubecapture/core_v1/velero/velero-fd9576677-jv989/velero/velero.log | egrep "level=info" | wc -l
10752

cat velero-bundle/kubecapture/core_v1/velero/velero-fd9576677-jv989/velero/velero.log | egrep -v "level=debug|level=info" | wc -l
2

grep "level=error" velero-bundle/kubecapture/core_v1/velero/velero-fd9576677-jv989/velero/velero.log -B2 -A2
time="2024-06-28T20:31:02Z" level=debug msg="found preexisting restartable plugin process" backup=velero/cp-20240628203047 command=/plugins/velero-plugin-for-aws kind=ObjectStore logSource="pkg/plugin/clientmgmt/manager.go:144" name=velero.io/aws
time="2024-06-28T20:31:02Z" level=debug msg="Skip generating BackupVolumeInfo when the CSI feature is disabled." backup=velero/cp-20240628203047 logSource="internal/volume/volumes_information.go:516"
time="2024-06-28T20:31:02Z" level=error msg="Error uploading log file" backup=cp-20240628203047 bucket=codeengine-cp-dev-relint error="rpc error: code = Unknown desc = error putting object dev-relint-controlplane/backups/cp-20240628203047/cp-20240628203047-logs.gz: operation error S3: PutObject, https response error StatusCode: 400, RequestID: d2aa3a87-377d-4e29-bcd1-04d54ffe21c3, HostID: , api error MissingDigest: Missing required content hash for this request: Content-MD5 or x-amz-content-sha256" error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:280" error.function="main.(*ObjectStore).PutObject" logSource="pkg/persistence/object_store.go:256" prefix=dev-relint-controlplane
time="2024-06-28T20:31:02Z" level=info msg="Initial backup processing complete, moving to Finalizing" backup=velero/cp-20240628203047 logSource="pkg/controller/backup_controller.go:756"
time="2024-06-28T20:31:02Z" level=debug msg="received EOF, stopping recv loop" backup=velero/cp-20240628203047 cmd=/plugins/velero-plugin-for-aws err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
--
time="2024-06-28T20:31:02Z" level=info msg="plugin process exited" backup=velero/cp-20240628203047 cmd=/velero id=1380 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/velero
time="2024-06-28T20:31:02Z" level=debug msg="plugin exited" backup=velero/cp-20240628203047 cmd=/velero logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75"
time="2024-06-28T20:31:02Z" level=error msg="backup failed" backuprequest=velero/cp-20240628203047 controller=backup error="rpc error: code = Unknown desc = error putting object dev-relint-controlplane/backups/cp-20240628203047/velero-backup.json: operation error S3: PutObject, https response error StatusCode: 400, RequestID: ec71a2ad-9f33-41d2-8a80-67cd6401dc06, HostID: , api error MissingDigest: Missing required content hash for this request: Content-MD5 or x-amz-content-sha256" logSource="pkg/controller/backup_controller.go:288"
time="2024-06-28T20:31:02Z" level=info msg="Updating backup's final status" backuprequest=velero/cp-20240628203047 controller=backup logSource="pkg/controller/backup_controller.go:307"
time="2024-06-28T20:31:02Z" level=debug msg="Getting Backup" backup=velero/cp-20240628203047 controller=backup-finalizer logSource="pkg/controller/backup_finalizer_controller.go:91"

kaovilai commented 2 months ago

@Wayne-H-Ha try this image with debug logging enabled. ghcr.io/kaovilai/velero-plugin-for-aws:sdk-v2-logging

from https://github.com/vmware-tanzu/velero-plugin-for-aws/pull/207

Then relay that info to IBM COS

Wayne-H-Ha commented 2 months ago

@kaovilai Thanks for providing the image with debug logging enabled. I have reproduced the problem and sent the new logs to IBM COS for them to investigate.

kaovilai commented 2 months ago

Let us know of any updates.

Wayne-H-Ha commented 2 months ago

IBM COS said since our bucket has retention policy set, setting checksumAlgorithm to "" will not work for us. They need to implement sdkv2 support in IBM COS.

sseago commented 2 months ago

@Wayne-H-Ha So does this mean a new version of IBM COS will be needed? Is this on the roadmap?

vmware-tanzu / velero

velero backup create fails to upload backup to s3 using aws plugin #7543

Discussed in https://github.com/vmware-tanzu/velero/discussions/7542