noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
265 stars 78 forks source link

Warp with small objects #8159

Open javieramirez1 opened 2 months ago

javieramirez1 commented 2 months ago

Environment info

Actual behavior

  1. when im trying to run 100 warp instances (25 for each app node) in order to use 100 accounts (one for each instance) in addition to using 20 connections per instance (2000 in total) for small objects where these errors were seen:

noobaa.log:

2024-06-07T06:13:29.739738-04:00 c83f2-dan9 node[3192156]: [nsfs/3192156] [ERROR] CONSOLE:: nsfs: exit on error Error: failed to create NSFS system data due to - ENOENT: no such file or directory, unlink '/gpfs/remote_rw_cessharedroot/ces/s3-config/system.json.lx4j3mkp-458a72e6' at init_nsfs_system (/usr/local/noobaa-core/src/cmd/nsfs.js:233:23) at async main (/usr/local/noobaa-core/src/cmd/nsfs.js:320:27) 2024-06-07T06:13:34.236337-04:00 c83f2-dan9 node[3192178]: [nsfs/3192178] [ERROR] CONSOLE:: failed to create NSFS system data due to - ENOENT: no such file or directory, unlink '/gpfs/remote_rw_cessharedroot/ces/s3-config/system.json.lx4j3q1h-fbdf2057' [Error: ENOENT: no such file or directory, unlink '/gpfs/remote_rw_cessharedroot/ces/s3-config/system.json.lx4j3q1h-fbdf2057'] { errno: -2, code: 'ENOENT', syscall: 'unlink', path: '/gpfs/remote_rw_cessharedroot/ces/s3-config/system.json.lx4j3q1h-fbdf2057' }

Expected behavior

  1. 100 instances of warp complete

Steps to reproduce

  1. run 100 instances of warp using 20 conecctions per node

More information - Screenshots / Logs / Other output

romayalon commented 2 months ago

@javieramirez1 For efficient troubleshooting, we would need -

  1. Reproduction steps.
  2. Full NooBaa service logs, debug level is nsfs.
  3. config.json file content.
  4. S3 CLI command (if the failure originates from S3 flow).
  5. S3 CLI response (if the failure originates from S3 flow).
javieramirez1 commented 2 months ago

@romayalon From my side more info is required?

romayalon commented 2 months ago

Hey @javieramirez1 Thanks for the info provided on Slack, updating here as well that it looks like the underlying file system (GPFS) after throws ENOSPC error (No space left on device) and later throws ENOENT on the shared root path and also throws ENOENT on other objects files paths, the missing shared root causing NooBaa to crash, and after a while NooBaa is up again, please check the file system logs on around the crashing timestamp.

romayalon commented 2 months ago

Updating regarding Internal Error - looks like a bug, but doesn't seem related to the connection refused. - This happened because of the request was anonymous, we support anonymous on master but not on 5.15.4. Also, trying to reproduce the cancelled due to ctime change error, I saw in the logs it's coming from PUT/DELETE/GET and the stat() that failed was on the bucket path. CC: @guymguym

javieramirez1 commented 1 month ago

@romayalon I keep seeing those refused connection errors now for arch power adding noobaa.log noobaa.log

javieramirez1 commented 1 month ago

this errors occurs when im trying to use 4k, im not sure why but it is observed in the ces node how the connections increase in large quantities greater than what was planned, reaching up to 28k after that the connection refused errors are seen

romayalon commented 1 month ago

@javieramirez1 Did you take these logs at the time of the connection refused errors? this log file contains only 20 minutes from start - Jul 16 09:29:21 to end - Jul 16 09:45:49

javieramirez1 commented 1 month ago

@romayalon yes the runs of warp that I did lasted 10mins

romayalon commented 1 month ago

@javieramirez1 the ctime fix was merged to master, could you please verify the fix?

javieramirez1 commented 2 weeks ago

The defect was reviewed with the side build that noobaa team provided me spectre16-ib:  noobaa-core-5.18.0-20240818.el9.x86_64 by following the steps that they also provided me for the s3 cleanup, it was done completely, with the cleanup there were no issues with the installation, only some dependency problems that only installed the rpms of the dependencies and they were solved, the corresponding changes were made in the mms3 config (DEBUGLEVEL, ENDPOINT_FORKS, UVTHREADPOOLSIZE)

and 2 workloads were performed, one of 15 minutes and another of 1 hour with the following specifications  warp mixed --host=9.11.137.136:6443 --access-key="$access_key" --secret-key="$secret_key" --obj.size=1k --objects=2000 --duration=60m --disable-multipart --concurrent=1000 --bucket="warp-new-bucket-aug13-put13$i" --insecure --tls 

For the 15min duration and the 1 hour duration, 4 instances of warp were run to gather the 4k connections 

On a single protocol node, as a first observation there were no more problems in this cluster when reaching 4k connections but, in theory, the side build was to fix the issue of anonymous requests which are still observed.

warp 15mins:

(Wed Aug 21 14:46:08) spectre3:~/javi # ./warpx86 5000 5000 4 172.16.15.140

warp: upload error:The specified key does not exist. warp: upload error:The specified key does not exist.

warp 1h:

(Wed Aug 21 15:11:01) spectre3:~/javi # ./warpx86 5000 5000 4 172.16.15.140

warp: upload error:The specified key does not exist. warp: upload error:The specified key does not exist.

Another issue observed was that the noobaa logs are not showing anything, the node to which the cesip that was used points is completely empty. [root@spectre13 log]# mmces address list | grep spectre13 10.18.56.34    spectre13-ib   none        none 9.11.137.136   spectre13-ib   none        none [root@spectre13 log]# cat noobaa.log [root@spectre13 log]#

[root@spectre8 ~]# mmdsh -N cesnodes rpm -qa |grep noobaa spectre12-ib:  noobaa-core-5.18.0-20240818.el8.x86_64 spectre8-ib:  noobaa-core-5.18.0-20240818.el9.x86_64 spectre16-ib:  noobaa-core-5.18.0-20240818.el9.x86_64 spectre10-ib:  noobaa-core-5.18.0-20240818.el8.x86_64 spectre9-ib:  noobaa-core-5.18.0-20240818.el9.x86_64 spectre11-ib:  noobaa-core-5.18.0-20240818.el9.x86_64 spectre13-ib:  noobaa-core-5.18.0-20240818.el9.x86_64 spectre6-ib:  noobaa-core-5.18.0-20240818.el8.x86_64 spectre14-ib:  noobaa-core-5.18.0-20240818.el8.x86_64 spectre15-ib:  noobaa-core-5.18.0-20240818.el8.x86_64

[root@spectre8 ~]# mmdsh -N cesnodes rpm -qa |grep s3 spectre16-ib:  gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre15-ib:  gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64 spectre8-ib:  gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre9-ib:  gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre13-ib:  gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre11-ib:  gpfs.mms3-5.2.1-0.240722.155845.el9.x86_64 spectre14-ib:  gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64 spectre12-ib:  gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64 spectre10-ib:  gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64 spectre6-ib:  gpfs.mms3-5.2.1-0.240722.155845.el8.x86_64

romayalon commented 2 weeks ago

@javieramirez1

  1. So you don't see the original issue anymore?
  2. Do you see the ctime error in the logs? cancelled due to ctime change
  3. Anonymous requests - you need first to create an "anonymous" user in your system for making anonymous requests work - https://github.com/noobaa/noobaa-core/blob/master/docs/NooBaaNonContainerized/S3Ops.md#anonymous-requests-support please check the steps provided in the link.
  4. We no longer write directly to rsyslog - noobaa.log, you can check noobaa logs using journal - https://github.com/noobaa/noobaa-core/blob/master/docs/NooBaaNonContainerized/Logging.md#journal-logs
javieramirez1 commented 2 weeks ago

@romayalon

  1. yes I see the original error on warp
  2. no on noobaa logs i dont see the error cancelled due to ctime change
  3. it's necessary to create anonyous requests?
  4. I will continue checking more in detail noobaalog with journal thanks
romayalon commented 1 week ago

@javieramirez1 Can I get access to your machine to check the original issue? 3 - yes, for allowing anonymous requests to succeed.

romayalon commented 1 week ago

@javieramirez1 do you use --noclear flag when running 4 instances of warp at the same time? warp cleans the objects on the preparing stage, @nadavMiz and I tried it and when not using --noclear we do see your error, but it makes sense because the second warp run deletes the objects. please try running again without clearing the objects by using --noclear.