noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
269 stars 78 forks source link

The specified key does not exist #8463

Open javieramirez1 opened 5 days ago

javieramirez1 commented 5 days ago

Environment info

Actual behavior

  1. Errors from warp versioned with --concurrent=1000:
    
    warp: <ERROR> stat error:The specified key does not exist.
    warp: <ERROR> stat error:The specified key does not exist.
    warp: <ERROR> download error: The specified key does not exist.
    warp: <ERROR> stat error:The specified key does not exist.
    warp: <ERROR> download error: The specified key does not exist.
    warp: <ERROR> download error: The specified key does not exist.
    warp: Benchmark data written to "warp-versioned-2024-10-14[124514]-keAl.csv.zst"

Mixed operations.

Operation: DELETE, 10%, Concurrency: 1000, Ran 1h0m1s.

Operation: GET, 45%, Concurrency: 1000, Ran 1h0m1s. Errors:10081

Operation: PUT, 15%, Concurrency: 1000, Ran 1h0m1s.

Operation: STAT, 30%, Concurrency: 1000, Ran 1h0m1s. Errors:6922

Cluster Total: 0.43 MiB/s, 742.94 obj/s, 17003 errors over 1h0m0s. Total Errors:17003.



### Expected behavior
1. Completed without problems 

### Steps to reproduce

`warp versioned  --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key"  --obj.size=1k  --duration=1h   --objects=10000 --concurrent=1000  --bucket="bucket50" --insecure --tls`

### More information - Screenshots / Logs / Other output

`Oct 14 13:00:08 c83f2-dan8 [3509675]: [nsfs/3509675] ESC[31m[ERROR]ESC[39m core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Resource>/bucket50/Jhl7e1LJ/14.C3I15LHjsQnXi8fk.rnd?versionId=mtime-d4voqndv2cxs-ino-fjecw</Resource><RequestId>m299ff3p-b71c1l-xcx</RequestId></Error> HEAD /bucket50/Jhl7e1LJ/14.C3I15LHjsQnXi8fk.rnd?versionId=mtime-d4voqndv2cxs-ino-fjecw {"host":"172.20.100.61:6443","user-agent":"MinIO (linux; amd64) minio-go/v7.0.66 warp/0.7.7","authorization":"AWS4-HMAC-SHA256 Credential=PylTLnQTwWAPKP71Ipmf/20241014/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=ee6ae569bea2d553a4b5ae8e04757be4dc1efc8df3a890980f33d2e3fae9a0f7","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","x-amz-date":"20241014T171356Z"} Error: No such file or directory - context: Stat _path=/gpfs/remote_rw_fs0/buckets_1/bucket50/Jhl7e1LJ/.versions/14.C3I15LHjsQnXi8fk.rnd_mtime-d4voqndv2cxs-ino-fjecw`
nadavMiz commented 4 days ago

@javieramirez1

  1. what version of the rpm are you using? there is a new fix #8455 that deals with these exact errors. it was merged only two days ago. do you know if its included in your rpm?
  2. 1k concurrency is very high. we use a retry mechanism in case of competing threads, and we might just exhausted the number of retries. you try to either reduce concurrency, or increase number of retries by adding NSFS_RENAME_RETRIES=100 to the configuration file
  3. can you attach noobaa logs
romayalon commented 4 days ago

@nadavMiz RPM version is noobaa-core-5.17.0-20241012.el9.x86_64, it doesn't include your latest fix

javieramirez1 commented 4 days ago

I update the noobaa rpm to this one c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241015.el9.x86_64 c83f2-dan8-hs200.test.net: noobaa-core-5.17.0-20241015.el9.x86_64 but I'm still seeing the same errors, the fix is ​​included in another rpm or should this rpm have it

nadavMiz commented 3 days ago

I don't know if fix is included. you can check the logs, for a message such as: Oct 13 09:14:12 tmtscalets-protocol-1 node[4035721]: 2024-10-13 09:14:12.877917 [PID-4035721/TID-4036208] [L1] FS::FSWorker::Execute: LinkFileAt _wrap->_path=/ibm/fs1/teams/ceph-mye304ifvjqsigsyqh9eo8i3-1 _wrap->_fd=24 _filepath=/ibm/fs1/teams/ceph-mye304ifvjqsigsyqh9eo8i3-1/myobj _should_not_override=0 took: 2.10294 ms the key is that _should_not_override should be included in the message.

In case the fix is included and increasing the number of retries doesn't help, or you are not sure, please attach noobaa logs so we can investigate

javieramirez1 commented 2 days ago

I made an rpm update to this c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241016.el9.x86_64 added the change in the config, restarted s3 and it no longer blocks my warp workloads (because when I added the change with the previous rpm it starts with many connection refused errors and then ends it like this connection reset by peer and from then on any run I try fails automatically like thiswarp: <ERROR> Error preparing server. Get "https://172.20.100.62:6443/bucket53/?location=": Connection closed by foreign host https:// 172.20.100.62:6443/bucket53/?location=. Retry again.) When I finished the warp run that I'm running, I'll report the results and add the log (I already enabled loglevel=all)