Closed javieramirez1 closed 4 weeks ago
@javieramirez1
@nadavMiz RPM version is noobaa-core-5.17.0-20241012.el9.x86_64, it doesn't include your latest fix
I update the noobaa rpm to this one c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241015.el9.x86_64 c83f2-dan8-hs200.test.net: noobaa-core-5.17.0-20241015.el9.x86_64 but I'm still seeing the same errors, the fix is included in another rpm or should this rpm have it
I don't know if fix is included. you can check the logs, for a message such as:
Oct 13 09:14:12 tmtscalets-protocol-1 node[4035721]: 2024-10-13 09:14:12.877917 [PID-4035721/TID-4036208] [L1] FS::FSWorker::Execute: LinkFileAt _wrap->_path=/ibm/fs1/teams/ceph-mye304ifvjqsigsyqh9eo8i3-1 _wrap->_fd=24 _filepath=/ibm/fs1/teams/ceph-mye304ifvjqsigsyqh9eo8i3-1/myobj _should_not_override=0 took: 2.10294 ms
the key is that _should_not_override
should be included in the message.
In case the fix is included and increasing the number of retries doesn't help, or you are not sure, please attach noobaa logs so we can investigate
I made an rpm update to this c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241016.el9.x86_64
added the change in the config, restarted s3 and it no longer blocks my warp workloads (because when I added the change with the previous rpm it starts with many connection refused errors and then ends it like this connection reset by peer and from then on any run I try fails automatically like thiswarp: <ERROR> Error preparing server. Get "https://172.20.100.62:6443/bucket53/?location=": Connection closed by foreign host https:// 172.20.100.62:6443/bucket53/?location=. Retry again
.) When I finished the warp run that I'm running, I'll report the results and add the log (I already enabled loglevel=all)
@javieramirez1 looking at the logs you sent on slack I see two main issues:
there seems to be files where the version-id in the external attributes doesn't match the files name. this might indicate a corruption:
Oct 15 13:10:06 c83f2-dan10 [1639045]: [nsfs/1639045] [WARN] core.sdk.namespace_fs:: NamespaceFS.read_object_md: retrying retries=1 file_path=/gpfs/remote_rw_fs0/buckets_1/bucket50/zTYkb3SF/.versions/4.e
hrPrMJnZikl5HrE.rnd_mtime-d4wjicrgv3sw-ino-15s2h [Error: No such file or directory] { code: 'ENOENT', context: 'Stat _path=/gpfs/remote_rw_fs0/buckets_1/bucket50/zTYkb3SF/.versions/4.ehrPrMJnZikl5HrE.rnd_
mtime-d4wjicrgv3sw-ino-15s2h ' }
Oct 15 13:10:06 c83f2-dan10 [1639045]: [nsfs/1639045] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exi
st.</Message><Resource>/bucket50/zTYkb3SF/4.ehrPrMJnZikl5HrE.rnd?versionId=mtime-d4wjicrgv3sw-ino-15s2h</Resource><RequestId>m2ap84um-65kshx-thy</RequestId></Error> GET /bucket50/zTYkb3SF/4.ehrPrMJnZikl5H
rE.rnd?versionId=mtime-d4wjicrgv3sw-ino-15s2h {"host":"172.20.100.66:6443","user-agent":"MinIO (linux; amd64) minio-go/v7.0.66 warp/0.7.7","authorization":"AWS4-HMAC-SHA256 Credential=PylTLnQTwWAPKP71Ipmf
/20241015/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=69e6cff9a23d8e9918046b0cf7144990655ffdf856cbcc3d64ded7df71364f47","x-amz-content-sha256":"e3b0c44298fc1c1
49afbf4c8996fb92427ae41e4649b934ca495991b7852b855","x-amz-date":"20241015T172025Z"} Error: No such file or directory - context: Stat _path=/gpfs/remote_rw_fs0/buckets_1/bucket50/zTYkb3SF/.versions/4.ehrPr
in the end of the log file there is a panic for to many files open:
Oct 15 23:55:49 c83f2-dan10 [2771795]: [nsfs/2771795] [ERROR] CONSOLE:: PANIC: process uncaughtException Error: EMFILE: too many open files, uv_resident_set_memory at process.memoryUsage (node:internal
/process/per_thread:168:5) at Timeout.memory_monitor [as _onTimeout] (/usr/local/noobaa-core/src/util/panic.js:40:23) at listOnTimeout (node:internal/timers:573:17) at process.processTimers (node
this might be what causing Connection closed by foreign host
error
the issue is that the logs file is 12 hours long and its hard to understand which error relates to which issue, for which rpm. also when the errors for the first issue happens it seems that the log level is the lowest, so I am missing a lot of logs.
also if the issue happens after 1 hour, there is no need to run it for 8 hours. can you replicate the issue and send the logs again with full logs, and specify start and end time of the warp. you can set the logs to highest level by running mms3 config change DEBUGLEVEL="all"
and can you specify the time the watp starts and finishes. also if you use the /var/log/messages logs you can run tail -f /var/log/messages > logs.txt
before the warp command on the noobaa node, and abort it when the warp finishes. this will store only the logs that were printed during the warp command in the logs.txt file. so the logs file will be much smaller that way
The following steps were performed: 1) rpm upgrade c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241026.el9.x86_64 c83f2-dan8-hs200.test.net: noobaa-core-5.17.0 -20241026.el9.x86_64 2) restart s3 3) bucket versioning enable AWS_ACCESS_KEY_ID=38MmE379Nxz54ELdjjUa AWS_SECRET_ACCESS_KEY=7Cciy3WHrMGKZbgBS6Sus5zhVlYz7rHPpkVIvUqW aws --endpoint https://172.20.10 0.60:6443 --no-verify-ssl s3api put-bucket-versioning --bucket bucket5001 --versioning-configuration Status=Enabled [root@c83f0-app1 scriprts]# AWS_ACCESS_KEY_ID=38MmE379Nxz54ELdjjUa AWS_SECRET_ACCESS_KEY=7Cciy3WHrMGKZbgBS6Sus5zhVlYz7rHPpkVIvUq W aws --endpoint https://172.20.100.61:6443/ --no- verify-ssl s3api get-bucket-versioning --bucket bucket5001 urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host '172.20.100.61'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings { "Status": "Enabled" }
4)run the next workload addign the flag --stress warp versioned --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --duration=1h -- stress --objects=10000 --concurrent=100 --bucket="bucket5001" --insecure –tls
After completing the workload at the end of the hour, it strangely does not show the final results of warp and it is not in the directory where it should be found (the system has enough space). The noobaa log does not show anything about it.
I attached the log via slack The log was cleared so that the start is the start of the test
After the last comment the following steps were performed:
1) change parameter NSFS_RENAME_RETRIES=1000 to NSFS_RENAME_RETRIES=100 2) restart s3 3) retry the workload from comment 20 point 4
The same results were observed, when the workload finished after 1 hour it does not show the final results of warp and they do not appear in the directory they should be in.
I attached the log via slack The log was cleared so that the start is the start of the test
@javieramirez1 2 questions -
4)run the next workload addign the flag --stress warp versioned --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --duration=1h -- stress --objects=10000 --concurrent=100 --bucket="bucket5001" --insecure –tls
Oct 15 23:55:49 c83f2-dan10 [2771795]: [nsfs/2771795] [ERROR] CONSOLE:: PANIC: process uncaughtException Error: EMFILE: too many open files, uv_resident_set_memory at process.memoryUsage (node:internal
/process/per_thread:168:5) at Timeout.memory_monitor [as _onTimeout] (/usr/local/noobaa-core/src/util/panic.js:40:23) at listOnTimeout (node:internal/timers:573:17) at process.processTimers (node
@javieramirez1 @nadavMiz
I downloaded the logs and couldn't find PANIC
less -R ../Downloads/Ver-100-1h-noobaa.log | grep PANIC
less -R ../Downloads/Ver-rename100-100-1h-noobaa.log | grep PANIC
on new logs issue is different. we now get ENOENT when getting / heading versions:
Oct 28 20:05:09 c83f2-dan10 [3852289]: [nsfs/3852289] [WARN] core.sdk.namespace_fs:: NamespaceFS.read_object_md: retrying retries=1 file_path=/gpfs/remote_rw_fs0/buckets_2/bucket5001/(T)85WK5/.versions/6
7.lXqtres9RNBuR4Xa.rnd_mtime-d57tem9337k0-ino-t2x5f [Error: No such file or directory] { code: 'ENOENT', context: 'Stat _path=/gpfs/remote_rw_fs0/buckets_2/bucket5001/(T)85WK5/.versions/67.lXqtres9RNBuR4X
a.rnd_mtime-d57tem9337k0-ino-t2x5f ' }
Oct 28 20:05:09 c83f2-dan10 [3852289]: [nsfs/3852289] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Resource>/bucket5001/%28T%2985WK5/67.lXqtres9RNBuR4Xa.rnd?versionId=mtime-d57tem9337k0-ino-t2x5f</Resource><RequestId>m2torxie-bpipd9-chr</RequestId></Error> HEAD /bucket5001/%28T%2985WK5/67.lXqtres9RNBuR4Xa.rnd?versionId=mtime-d57tem9337k0-ino-t2x5f {"host":"172.20.100.67:6443","user-agent":"MinIO (linux; amd64) minio-go/v7.0.66 warp/0.7.7","authorization":"AWS4-HMAC-SHA256 Credential=38MmE379Nxz54ELdjjUa/20241029/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=80af6395eae063609c5f37f5b141e9ed0843e4565ba856156154e6df29e9cf2b","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","x-amz-date":"20241029T001449Z"} Error: No such file or directory - context: Stat _path=/gpfs/remote_rw_fs0/buckets_2/bucket5001/(T)85WK5/.versions/67.lXqtres9RNBuR4Xa.rnd_mtime-d57tem9337k0-ino-t2x5f
this indicates that the version is missing completely. issue was reproduced on my enviorment. the issue happens when the version we try to delete is the latest version. and the latest version is replaced before we delete it. then we delete the wrong version. added a check that the version we try to delete is the intended one.
ran warp on my system for 30 min with 1000 concurrency, there were no issues:
[root@tmtscalets-master warp]# ./warp versioned --host=10.11.87.62:6443 --access-key="3MwC7LPBgfQ1maEg4ZLN" --secret-key="auCOhJrI6znaabNZv3oNr4d4QltxWg3Z/sJodY1H" --obj.size=1k --concurrent=1000 --duration=30m --bucket=warp-bu5ket1 --insecure --tls --objects=10000
warp: Benchmark data written to "warp-versioned-2024-10-29[050626]-34vw.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 1000, Ran 28m38s.
* Throughput: 15.33 obj/s
Operation: GET, 45%, Concurrency: 1000, Ran 28m38s.
* Throughput: 0.07 MiB/s, 68.98 obj/s
Operation: PUT, 15%, Concurrency: 1000, Ran 28m38s.
* Throughput: 0.02 MiB/s, 22.89 obj/s
Operation: STAT, 30%, Concurrency: 1000, Ran 28m38s.
* Throughput: 46.03 obj/s
Cluster Total: 0.09 MiB/s, 153.09 obj/s over 28m38s.
warp: Cleanup Done.[root@tmtscalets-master warp]#
@javieramirez1 on new logs, both the panic, and wrong version did not happen, and I couldn't reproduce them. if they happen again, please open a new bug for them
hi @nadavMiz 1 question I saw that you didn't used the flag --stress I should try to reproduce it without that flag?
@javieramirez1 from the documentation, the --stress is used for long-running tests. I am not sure if it matters, but might as well add it for good measure, especially if you get failures after long time.
The following steps were performed: 1) rpm upgrade c83f2-dan10-hs200.test.net: noobaa-core-5.17.0-20241030.el9.x86_64 c83f2-dan8-hs200.test.net: noobaa-core-5.17.0 -20241030.el9.x86_64 2) restart s3 3) bucket versioning enable
[root@c83f0-app1 scriprts]# AWS_ACCESS_KEY_ID=pMy4T58Ls36PW482bLBD AWS_SECRET_ACCESS_KEY=coJ8LsApyle9vRRkhKKIkWvEd00FC1dfJsvNqs99 aws --endpoint https://172.20.100.60:6443/ --no-verify-ssl s3api put-bucket-versioning --bucket bucket5004 --versioning-configuration Status=Enabled urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host '172.20.100.60'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings [root@c83f0-app1 scriprts]# AWS_ACCESS_KEY_ID=pMy4T58Ls36PW482bLBD AWS_SECRET_ACCESS_KEY=coJ8LsApyle9vRRkhKKIkWvEd00FC1dfJsvNqs99 aws --endpoint https://172.20.100.61:6443/ --no-verify-ssl s3api get-bucket-versioning --bucket bucket5004 urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host '172.20.100.61'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings { "Status": "Enabled" }
4)run the next workload warp versioned --host=172.20.100.6{0...9}:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --duration=1h --objects=10000 --concurrent=1000 --bucket="bucket5004" --insecure --tls
5) Now at the end it shows the results but the key errors are still observed
warp:
Operation: GET, 45%, Concurrency: 1000, Ran 1h0m1s. Errors:231
Operation: PUT, 15%, Concurrency: 1000, Ran 1h0m2s.
Operation: STAT, 30%, Concurrency: 1000, Ran 1h0m1s. Errors:144
Cluster Total: 0.74 MiB/s, 1295.32 obj/s, 375 errors over 1h0m0s. Total Errors:375.
I attached the log via slack
@javieramirez1 we didn't backport this fix yet to 5.17.0, so it makes sense you still see this issue.
Environment info
Actual behavior
Mixed operations.
Operation: DELETE, 10%, Concurrency: 1000, Ran 1h0m1s.
Operation: GET, 45%, Concurrency: 1000, Ran 1h0m1s. Errors:10081
Operation: PUT, 15%, Concurrency: 1000, Ran 1h0m1s.
Operation: STAT, 30%, Concurrency: 1000, Ran 1h0m1s. Errors:6922
Cluster Total: 0.43 MiB/s, 742.94 obj/s, 17003 errors over 1h0m0s. Total Errors:17003.