noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
273 stars 80 forks source link

NC | NSFS | Number of FDs of NooBaa Main Process #8542

Open shirady opened 2 days ago

shirady commented 2 days ago

Environment info

Actual behavior

  1. When running a warp test of get to 1 node with noobaa with 32 forks and monitoring the number of file descriptors (FD) of the main process, we can see that in some runs, it can increase to high numbers (for example, during a run I saw 171,910). Anyway, after the run is finished (or if we kill the process), we get to the initial number of 56. In the past, we limited this number to 65,536, and we increased it (see PR #8518). But there are cases where this number is unchanged and stays 56 for the whole run.

Expected behavior

  1. We might want to investigate and understand the source of that behavior.

Steps to reproduce

Copied from issue #8471 (see this comment):

From a node in the cluster that runs noobaa:

  1. Change the config.json in path vi /ibm/fs1/cessharedroot/ces/s3-config/config.json and add the values: "UV_THREADPOOL_SIZE": 16 and "ENDPOINT_FORKS":32. Note: without this change running it would result in an error of timeout during the preparing step of warp
  2. mms3 config change DEBUGLEVEL="all" for the restart.
  3. Run this script run_counter_fd.sh by ./run_counter_fd.sh <main PID> (the <main PID> is the main process ID from system status noobaa).
    
    #!/bin/bash

echo $0 $1

monitor open file descriptor count every 10 seconds

while true; do ls -al /proc/$1/fd | wc -l date sleep 10 done


Note: you can also use 'lsof -c noobaa` and try to analyze it (we saw many cases of TCP sockets in state `WAIT_CLOSE`).

**From a client node:**
5. Create an account:  `noobaa-cli account add --name warp-shira --new_buckets_path /ibm/fs1/teams/ --uid 1001 gid 1001 --fs_backend GPFS`
6. Create the alias for the account (based on the existing account):
`alias s3-u1='AWS_ACCESS_KEY_ID=<> AWS_SECRET_ACCESS_KEY=<> aws --no-verify-ssl --endpoint <ip-address-node> --no-verify-ssl'`
Check the connection (by trying to list the buckets of the account): `s3-u1 s3 ls; echo $?`
7. Run the warp command: `cd warp;`
`./warp get --host=<ip-address-node> --access-key="<>" --secret-key="<>" --obj.size=1k --concurrent=1000 --duration=30m --bucket=bsw-01 --insecure --tls` (I run it with 1 host)

### More information - Screenshots / Logs / Other output
attached 2 partials runs (it is partial due to a corrupted connection on my side - not related to the issue):
1. Without any change to the FD number.
2. Increasing number of FDs.
[open_fds_02.txt](https://github.com/user-attachments/files/17830019/open_fds_02.txt)
[open_fds_01.txt](https://github.com/user-attachments/files/17830021/open_fds_01.txt)
shirady commented 2 days ago

Hi, In case you want to add something to the discussion - @dannyzaken @guymguym if you think it is interesting to keep the investigation and have suggestions for things we can test, and improve our knowledge in internal NodeJS behavior, etc.

@romayalon @nadavMiz if you think I missed any detail (I can either edit the description or add a comment)