NSFS | Concurrency Tests | Warp fails on The specified key does not exist - concurrency = 1000

javieramirez1 commented 1 month ago

Environment info

NooBaa Version: noobaa-core-5.17.0-20241012.el9.x86_64

Actual behavior

Errors from warp versioned with --concurrent=1000:


warp: <ERROR> stat error:The specified key does not exist.
warp: <ERROR> stat error:The specified key does not exist.
warp: <ERROR> download error: The specified key does not exist.
warp: <ERROR> stat error:The specified key does not exist.
warp: <ERROR> download error: The specified key does not exist.
warp: <ERROR> download error: The specified key does not exist.
warp: Benchmark data written to "warp-versioned-2024-10-14[124514]-keAl.csv.zst"

Mixed operations.

Operation: DELETE, 10%, Concurrency: 1000, Ran 1h0m1s.

Throughput:74.77 obj/s

Operation: GET, 45%, Concurrency: 1000, Ran 1h0m1s. Errors:10081

Throughput:0.32 MiB/s, 333.67 obj/s

Operation: PUT, 15%, Concurrency: 1000, Ran 1h0m1s.

Throughput:0.11 MiB/s, 112.10 obj/s

Operation: STAT, 30%, Concurrency: 1000, Ran 1h0m1s. Errors:6922

Throughput:222.40 obj/s

Cluster Total: 0.43 MiB/s, 742.94 obj/s, 17003 errors over 1h0m0s. Total Errors:17003.



### Expected behavior
1. Completed without problems 

### Steps to reproduce

`warp versioned  --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key"  --obj.size=1k  --duration=1h   --objects=10000 --concurrent=1000  --bucket="bucket50" --insecure --tls`

### More information - Screenshots / Logs / Other output

`Oct 14 13:00:08 c83f2-dan8 [3509675]: [nsfs/3509675] ESC[31m[ERROR]ESC[39m core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Resource>/bucket50/Jhl7e1LJ/14.C3I15LHjsQnXi8fk.rnd?versionId=mtime-d4voqndv2cxs-ino-fjecw</Resource><RequestId>m299ff3p-b71c1l-xcx</RequestId></Error> HEAD /bucket50/Jhl7e1LJ/14.C3I15LHjsQnXi8fk.rnd?versionId=mtime-d4voqndv2cxs-ino-fjecw {"host":"172.20.100.61:6443","user-agent":"MinIO (linux; amd64) minio-go/v7.0.66 warp/0.7.7","authorization":"AWS4-HMAC-SHA256 Credential=PylTLnQTwWAPKP71Ipmf/20241014/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=ee6ae569bea2d553a4b5ae8e04757be4dc1efc8df3a890980f33d2e3fae9a0f7","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","x-amz-date":"20241014T171356Z"} Error: No such file or directory - context: Stat _path=/gpfs/remote_rw_fs0/buckets_1/bucket50/Jhl7e1LJ/.versions/14.C3I15LHjsQnXi8fk.rnd_mtime-d4voqndv2cxs-ino-fjecw`

nadavMiz commented 1 month ago

@javieramirez1

what version of the rpm are you using? there is a new fix #8455 that deals with these exact errors. it was merged only two days ago. do you know if its included in your rpm?
1k concurrency is very high. we use a retry mechanism in case of competing threads, and we might just exhausted the number of retries. you try to either reduce concurrency, or increase number of retries by adding NSFS_RENAME_RETRIES=100 to the configuration file
can you attach noobaa logs

romayalon commented 1 month ago

@nadavMiz RPM version is noobaa-core-5.17.0-20241012.el9.x86_64, it doesn't include your latest fix

javieramirez1 commented 1 month ago

I update the noobaa rpm to this one c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241015.el9.x86_64 c83f2-dan8-hs200.test.net: noobaa-core-5.17.0-20241015.el9.x86_64 but I'm still seeing the same errors, the fix is included in another rpm or should this rpm have it

nadavMiz commented 1 month ago

I don't know if fix is included. you can check the logs, for a message such as: Oct 13 09:14:12 tmtscalets-protocol-1 node[4035721]: 2024-10-13 09:14:12.877917 [PID-4035721/TID-4036208] [L1] FS::FSWorker::Execute: LinkFileAt _wrap->_path=/ibm/fs1/teams/ceph-mye304ifvjqsigsyqh9eo8i3-1 _wrap->_fd=24 _filepath=/ibm/fs1/teams/ceph-mye304ifvjqsigsyqh9eo8i3-1/myobj _should_not_override=0 took: 2.10294 ms the key is that _should_not_override should be included in the message.

In case the fix is included and increasing the number of retries doesn't help, or you are not sure, please attach noobaa logs so we can investigate

javieramirez1 commented 1 month ago

I made an rpm update to this c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241016.el9.x86_64 added the change in the config, restarted s3 and it no longer blocks my warp workloads (because when I added the change with the previous rpm it starts with many connection refused errors and then ends it like this connection reset by peer and from then on any run I try fails automatically like thiswarp: <ERROR> Error preparing server. Get "https://172.20.100.62:6443/bucket53/?location=": Connection closed by foreign host https:// 172.20.100.62:6443/bucket53/?location=. Retry again.) When I finished the warp run that I'm running, I'll report the results and add the log (I already enabled loglevel=all)

nadavMiz commented 1 month ago

@javieramirez1 looking at the logs you sent on slack I see two main issues:

there seems to be files where the version-id in the external attributes doesn't match the files name. this might indicate a corruption:

Oct 15 13:10:06 c83f2-dan10 [1639045]: [nsfs/1639045]  [WARN] core.sdk.namespace_fs:: NamespaceFS.read_object_md: retrying retries=1 file_path=/gpfs/remote_rw_fs0/buckets_1/bucket50/zTYkb3SF/.versions/4.e
hrPrMJnZikl5HrE.rnd_mtime-d4wjicrgv3sw-ino-15s2h [Error: No such file or directory] { code: 'ENOENT', context: 'Stat _path=/gpfs/remote_rw_fs0/buckets_1/bucket50/zTYkb3SF/.versions/4.ehrPrMJnZikl5HrE.rnd_
mtime-d4wjicrgv3sw-ino-15s2h ' }
Oct 15 13:10:06 c83f2-dan10 [1639045]: [nsfs/1639045] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exi
st.</Message><Resource>/bucket50/zTYkb3SF/4.ehrPrMJnZikl5HrE.rnd?versionId=mtime-d4wjicrgv3sw-ino-15s2h</Resource><RequestId>m2ap84um-65kshx-thy</RequestId></Error> GET /bucket50/zTYkb3SF/4.ehrPrMJnZikl5H
rE.rnd?versionId=mtime-d4wjicrgv3sw-ino-15s2h {"host":"172.20.100.66:6443","user-agent":"MinIO (linux; amd64) minio-go/v7.0.66 warp/0.7.7","authorization":"AWS4-HMAC-SHA256 Credential=PylTLnQTwWAPKP71Ipmf
/20241015/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=69e6cff9a23d8e9918046b0cf7144990655ffdf856cbcc3d64ded7df71364f47","x-amz-content-sha256":"e3b0c44298fc1c1
49afbf4c8996fb92427ae41e4649b934ca495991b7852b855","x-amz-date":"20241015T172025Z"} Error: No such file or directory - context: Stat _path=/gpfs/remote_rw_fs0/buckets_1/bucket50/zTYkb3SF/.versions/4.ehrPr

in the end of the log file there is a panic for to many files open:

Oct 15 23:55:49 c83f2-dan10 [2771795]: [nsfs/2771795] [ERROR] CONSOLE:: PANIC: process uncaughtException Error: EMFILE: too many open files, uv_resident_set_memory    at process.memoryUsage (node:internal
/process/per_thread:168:5)    at Timeout.memory_monitor [as _onTimeout] (/usr/local/noobaa-core/src/util/panic.js:40:23)    at listOnTimeout (node:internal/timers:573:17)    at process.processTimers (node

this might be what causing Connection closed by foreign host error

the issue is that the logs file is 12 hours long and its hard to understand which error relates to which issue, for which rpm. also when the errors for the first issue happens it seems that the log level is the lowest, so I am missing a lot of logs.

also if the issue happens after 1 hour, there is no need to run it for 8 hours. can you replicate the issue and send the logs again with full logs, and specify start and end time of the warp. you can set the logs to highest level by running mms3 config change DEBUGLEVEL="all" and can you specify the time the watp starts and finishes. also if you use the /var/log/messages logs you can run tail -f /var/log/messages > logs.txt before the warp command on the noobaa node, and abort it when the warp finishes. this will store only the logs that were printed during the warp command in the logs.txt file. so the logs file will be much smaller that way

javieramirez1 commented 4 weeks ago

The following steps were performed: 1) rpm upgrade c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241026.el9.x86_64 c83f2-dan8-hs200.test.net: noobaa-core-5.17.0 -20241026.el9.x86_64 2) restart s3 3) bucket versioning enable AWS_ACCESS_KEY_ID=38MmE379Nxz54ELdjjUa AWS_SECRET_ACCESS_KEY=7Cciy3WHrMGKZbgBS6Sus5zhVlYz7rHPpkVIvUqW aws --endpoint https://172.20.10 0.60:6443 --no-verify-ssl s3api put-bucket-versioning --bucket bucket5001 --versioning-configuration Status=Enabled [root@c83f0-app1 scriprts]# AWS_ACCESS_KEY_ID=38MmE379Nxz54ELdjjUa AWS_SECRET_ACCESS_KEY=7Cciy3WHrMGKZbgBS6Sus5zhVlYz7rHPpkVIvUq W aws --endpoint https://172.20.100.61:6443/ --no- verify-ssl s3api get-bucket-versioning --bucket bucket5001 urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host '172.20.100.61'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings { "Status": "Enabled" }

4)run the next workload addign the flag --stress warp versioned --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --duration=1h -- stress --objects=10000 --concurrent=100 --bucket="bucket5001" --insecure –tls

After completing the workload at the end of the hour, it strangely does not show the final results of warp and it is not in the directory where it should be found (the system has enough space). The noobaa log does not show anything about it.

I attached the log via slack The log was cleared so that the start is the start of the test

javieramirez1 commented 4 weeks ago

After the last comment the following steps were performed:

1) change parameter NSFS_RENAME_RETRIES=1000 to NSFS_RENAME_RETRIES=100 2) restart s3 3) retry the workload from comment 20 point 4

The same results were observed, when the workload finished after 1 hour it does not show the final results of warp and they do not appear in the directory they should be in.

I attached the log via slack The log was cleared so that the start is the start of the test

romayalon commented 4 weeks ago

@javieramirez1 2 questions -

I see you have a space between -- and the stress flag, did you run it without the space?

4)run the next workload addign the flag --stress warp versioned --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --duration=1h -- stress --objects=10000 --concurrent=100 --bucket="bucket5001" --insecure –tls

do you see in the new logs the PANIC Nadav mentioned in https://github.com/noobaa/noobaa-core/issues/8463#issuecomment-2440100023 ?

Oct 15 23:55:49 c83f2-dan10 [2771795]: [nsfs/2771795] [ERROR] CONSOLE:: PANIC: process uncaughtException Error: EMFILE: too many open files, uv_resident_set_memory    at process.memoryUsage (node:internal
/process/per_thread:168:5)    at Timeout.memory_monitor [as _onTimeout] (/usr/local/noobaa-core/src/util/panic.js:40:23)    at listOnTimeout (node:internal/timers:573:17)    at process.processTimers (node

romayalon commented 4 weeks ago

@javieramirez1 @nadavMiz I downloaded the logs and couldn't find PANIC less -R ../Downloads/Ver-100-1h-noobaa.log | grep PANIC less -R ../Downloads/Ver-rename100-100-1h-noobaa.log | grep PANIC

nadavMiz commented 4 weeks ago

on new logs issue is different. we now get ENOENT when getting / heading versions:

Oct 28 20:05:09 c83f2-dan10 [3852289]: [nsfs/3852289]  [WARN] core.sdk.namespace_fs:: NamespaceFS.read_object_md: retrying retries=1 file_path=/gpfs/remote_rw_fs0/buckets_2/bucket5001/(T)85WK5/.versions/6
7.lXqtres9RNBuR4Xa.rnd_mtime-d57tem9337k0-ino-t2x5f [Error: No such file or directory] { code: 'ENOENT', context: 'Stat _path=/gpfs/remote_rw_fs0/buckets_2/bucket5001/(T)85WK5/.versions/67.lXqtres9RNBuR4X
a.rnd_mtime-d57tem9337k0-ino-t2x5f ' }
Oct 28 20:05:09 c83f2-dan10 [3852289]: [nsfs/3852289] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Resource>/bucket5001/%28T%2985WK5/67.lXqtres9RNBuR4Xa.rnd?versionId=mtime-d57tem9337k0-ino-t2x5f</Resource><RequestId>m2torxie-bpipd9-chr</RequestId></Error> HEAD /bucket5001/%28T%2985WK5/67.lXqtres9RNBuR4Xa.rnd?versionId=mtime-d57tem9337k0-ino-t2x5f {"host":"172.20.100.67:6443","user-agent":"MinIO (linux; amd64) minio-go/v7.0.66 warp/0.7.7","authorization":"AWS4-HMAC-SHA256 Credential=38MmE379Nxz54ELdjjUa/20241029/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=80af6395eae063609c5f37f5b141e9ed0843e4565ba856156154e6df29e9cf2b","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","x-amz-date":"20241029T001449Z"} Error: No such file or directory - context: Stat _path=/gpfs/remote_rw_fs0/buckets_2/bucket5001/(T)85WK5/.versions/67.lXqtres9RNBuR4Xa.rnd_mtime-d57tem9337k0-ino-t2x5f

this indicates that the version is missing completely. issue was reproduced on my enviorment. the issue happens when the version we try to delete is the latest version. and the latest version is replaced before we delete it. then we delete the wrong version. added a check that the version we try to delete is the intended one.

ran warp on my system for 30 min with 1000 concurrency, there were no issues:

[root@tmtscalets-master warp]# ./warp versioned --host=10.11.87.62:6443 --access-key="3MwC7LPBgfQ1maEg4ZLN" --secret-key="auCOhJrI6znaabNZv3oNr4d4QltxWg3Z/sJodY1H" --obj.size=1k --concurrent=1000 --duration=30m --bucket=warp-bu5ket1 --insecure --tls --objects=10000
warp: Benchmark data written to "warp-versioned-2024-10-29[050626]-34vw.csv.zst"                                                                                                                            
Mixed operations.
Operation: DELETE, 10%, Concurrency: 1000, Ran 28m38s.
 * Throughput: 15.33 obj/s

Operation: GET, 45%, Concurrency: 1000, Ran 28m38s.
 * Throughput: 0.07 MiB/s, 68.98 obj/s

Operation: PUT, 15%, Concurrency: 1000, Ran 28m38s.
 * Throughput: 0.02 MiB/s, 22.89 obj/s

Operation: STAT, 30%, Concurrency: 1000, Ran 28m38s.
 * Throughput: 46.03 obj/s

Cluster Total: 0.09 MiB/s, 153.09 obj/s over 28m38s.
warp: Cleanup Done.[root@tmtscalets-master warp]#

nadavMiz commented 4 weeks ago

@javieramirez1 on new logs, both the panic, and wrong version did not happen, and I couldn't reproduce them. if they happen again, please open a new bug for them

javieramirez1 commented 4 weeks ago

hi @nadavMiz 1 question I saw that you didn't used the flag --stress I should try to reproduce it without that flag?

nadavMiz commented 3 weeks ago

@javieramirez1 from the documentation, the --stress is used for long-running tests. I am not sure if it matters, but might as well add it for good measure, especially if you get failures after long time.

javieramirez1 commented 3 weeks ago

The following steps were performed: 1) rpm upgrade c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241030.el9.x86_64 c83f2-dan8-hs200.test.net: noobaa-core-5.17.0 -20241030.el9.x86_64 2) restart s3 3) bucket versioning enable

[root@c83f0-app1 scriprts]# AWS_ACCESS_KEY_ID=pMy4T58Ls36PW482bLBD AWS_SECRET_ACCESS_KEY=coJ8LsApyle9vRRkhKKIkWvEd00FC1dfJsvNqs99 aws --endpoint https://172.20.100.60:6443/ --no-verify-ssl s3api put-bucket-versioning --bucket bucket5004 --versioning-configuration Status=Enabled urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host '172.20.100.60'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings [root@c83f0-app1 scriprts]# AWS_ACCESS_KEY_ID=pMy4T58Ls36PW482bLBD AWS_SECRET_ACCESS_KEY=coJ8LsApyle9vRRkhKKIkWvEd00FC1dfJsvNqs99 aws --endpoint https://172.20.100.61:6443/ --no-verify-ssl s3api get-bucket-versioning --bucket bucket5004 urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host '172.20.100.61'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings { "Status": "Enabled" }

4)run the next workload warp versioned --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --duration=1h --objects=10000 --concurrent=1000 --bucket="bucket5004" --insecure --tls

5) Now at the end it shows the results but the key errors are still observed

warp: download error: The specified key does not exist. warp: stat error:The specified key does not exist. warp: stat error:The specified key does not exist. warp: Benchmark data written to "warp-versioned-2024-10-31[170327]-Sw4c.csv.zst" Mixed operations. Operation: DELETE, 10%, Concurrency: 1000, Ran 1h0m1s.

Throughput:129.55 obj/s

Operation: GET, 45%, Concurrency: 1000, Ran 1h0m1s. Errors:231

Throughput:0.56 MiB/s, 582.90 obj/s

Operation: PUT, 15%, Concurrency: 1000, Ran 1h0m2s.

Throughput:0.19 MiB/s, 194.26 obj/s

Operation: STAT, 30%, Concurrency: 1000, Ran 1h0m1s. Errors:144

Throughput:388.62 obj/s

Cluster Total: 0.74 MiB/s, 1295.32 obj/s, 375 errors over 1h0m0s. Total Errors:375.

I attached the log via slack

romayalon commented 3 weeks ago

@javieramirez1 we didn't backport this fix yet to 5.17.0, so it makes sense you still see this issue.

noobaa / noobaa-core

NSFS | Concurrency Tests | Warp fails on The specified key does not exist - concurrency = 1000 #8463

Environment info

Actual behavior