Open javieramirez1 opened 2 months ago
@javieramirez1 For efficient troubleshooting, we would need -
@romayalon From my side more info is required?
Hey @javieramirez1 Thanks for the info provided on Slack, updating here as well that it looks like the underlying file system (GPFS) after throws ENOSPC error (No space left on device) and later throws ENOENT on the shared root path and also throws ENOENT on other objects files paths, the missing shared root causing NooBaa to crash, and after a while NooBaa is up again, please check the file system logs on around the crashing timestamp.
Updating regarding Internal Error - looks like a bug, but doesn't seem related to the connection refused. - This happened because of the request was anonymous, we support anonymous on master but not on 5.15.4. Also, trying to reproduce the cancelled due to ctime change error, I saw in the logs it's coming from PUT/DELETE/GET and the stat() that failed was on the bucket path. CC: @guymguym
@romayalon I keep seeing those refused connection errors now for arch power adding noobaa.log noobaa.log
this errors occurs when im trying to use 4k, im not sure why but it is observed in the ces node how the connections increase in large quantities greater than what was planned, reaching up to 28k after that the connection refused errors are seen
@javieramirez1 Did you take these logs at the time of the connection refused errors? this log file contains only 20 minutes from start - Jul 16 09:29:21
to end - Jul 16 09:45:49
@romayalon yes the runs of warp that I did lasted 10mins
@javieramirez1 the ctime fix was merged to master, could you please verify the fix?
The defect was reviewed with the side build that noobaa team provided me spectre16-ib: noobaa-core-5.18.0-20240818.el9.x86_64 by following the steps that they also provided me for the s3 cleanup, it was done completely, with the cleanup there were no issues with the installation, only some dependency problems that only installed the rpms of the dependencies and they were solved, the corresponding changes were made in the mms3 config (DEBUGLEVEL, ENDPOINT_FORKS, UVTHREADPOOLSIZE)
and 2 workloads were performed, one of 15 minutes and another of 1 hour with the following specifications warp mixed --host=9.11.137.136:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --objects=2000 --duration=60m --disable-multipart --concurrent=1000 --bucket="warp-new-bucket-aug13-put13$i" --insecure --tls
For the 15min duration and the 1 hour duration, 4 instances of warp were run to gather the 4k connections
On a single protocol node, as a first observation there were no more problems in this cluster when reaching 4k connections but, in theory, the side build was to fix the issue of anonymous requests which are still observed.
warp 15mins:
(Wed Aug 21 14:46:08) spectre3:~/javi # ./warpx86 5000 5000 4 172.16.15.140
warp:
warp 1h:
(Wed Aug 21 15:11:01) spectre3:~/javi # ./warpx86 5000 5000 4 172.16.15.140
warp:
Another issue observed was that the noobaa logs are not showing anything, the node to which the cesip that was used points is completely empty. [root@spectre13 log]# mmces address list | grep spectre13 10.18.56.34 spectre13-ib none none 9.11.137.136 spectre13-ib none none [root@spectre13 log]# cat noobaa.log [root@spectre13 log]#
[root@spectre8 ~]# mmdsh -N cesnodes rpm -qa |grep noobaa spectre12-ib: noobaa-core-5.18.0-20240818.el8.x86_64 spectre8-ib: noobaa-core-5.18.0-20240818.el9.x86_64 spectre16-ib: noobaa-core-5.18.0-20240818.el9.x86_64 spectre10-ib: noobaa-core-5.18.0-20240818.el8.x86_64 spectre9-ib: noobaa-core-5.18.0-20240818.el9.x86_64 spectre11-ib: noobaa-core-5.18.0-20240818.el9.x86_64 spectre13-ib: noobaa-core-5.18.0-20240818.el9.x86_64 spectre6-ib: noobaa-core-5.18.0-20240818.el8.x86_64 spectre14-ib: noobaa-core-5.18.0-20240818.el8.x86_64 spectre15-ib: noobaa-core-5.18.0-20240818.el8.x86_64
[root@spectre8 ~]# mmdsh -N cesnodes rpm -qa |grep s3 spectre16-ib: gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre15-ib: gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64 spectre8-ib: gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre9-ib: gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre13-ib: gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre11-ib: gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre14-ib: gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64 spectre12-ib: gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64 spectre10-ib: gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64 spectre6-ib: gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64
@javieramirez1
cancelled due to ctime change
@romayalon
cancelled due to ctime change
@javieramirez1 Can I get access to your machine to check the original issue? 3 - yes, for allowing anonymous requests to succeed.
@javieramirez1 do you use --noclear
flag when running 4 instances of warp at the same time? warp cleans the objects on the preparing stage, @nadavMiz and I tried it and when not using --noclear
we do see your error, but it makes sense because the second warp run deletes the objects.
please try running again without clearing the objects by using --noclear
.
Environment info
Actual behavior
noobaa.log:
2024-06-07T06:13:29.739738-04:00 c83f2-dan9 node[3192156]: [nsfs/3192156] [ERROR] CONSOLE:: nsfs: exit on error Error: failed to create NSFS system data due to - ENOENT: no such file or directory, unlink '/gpfs/remote_rw_cessharedroot/ces/s3-config/system.json.lx4j3mkp-458a72e6' at init_nsfs_system (/usr/local/noobaa-core/src/cmd/nsfs.js:233:23) at async main (/usr/local/noobaa-core/src/cmd/nsfs.js:320:27) 2024-06-07T06:13:34.236337-04:00 c83f2-dan9 node[3192178]: [nsfs/3192178] [ERROR] CONSOLE:: failed to create NSFS system data due to - ENOENT: no such file or directory, unlink '/gpfs/remote_rw_cessharedroot/ces/s3-config/system.json.lx4j3q1h-fbdf2057' [Error: ENOENT: no such file or directory, unlink '/gpfs/remote_rw_cessharedroot/ces/s3-config/system.json.lx4j3q1h-fbdf2057'] { errno: -2, code: 'ENOENT', syscall: 'unlink', path: '/gpfs/remote_rw_cessharedroot/ces/s3-config/system.json.lx4j3q1h-fbdf2057' }
Expected behavior
Steps to reproduce
More information - Screenshots / Logs / Other output