noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
273 stars 80 forks source link

NC | NSFS | Panic Printings Added + Try-Catch `memory_monitor` + Change Default Event Logs + Increase `LimitNOFILE` #8518

Closed shirady closed 1 week ago

shirady commented 1 week ago

Explain the changes

  1. Panic printings added - in the initial investigation of the issue mentioned below we noticed that we didn't have any "ERROR" or "PANIC" printings and noobaa exited. Therefore, based on previous experience (for example in the noobaa-cli): https://github.com/noobaa/noobaa-core/blob/fc266c3e274f37d73d35135bdcadbe4f3cab8e18/src/cmd/manage_nsfs.js#L86-L88. We changed the to use it with a callback (from the function description):

"The optional callback parameter will be executed when the data is finally written out, which may not be immediately."

  1. Add try-catch clause in memory_monitor function - during the investigation of the issue mentioned below we noticed that there were cases when we used a high number of connections, noobaa was down due to error "EMFILE" and in the error stack, we saw that the memory_monitor function throw it from the process.memoryUsage() which is a node js function. The error stack for example:

Nov-11 8:14:10.250 [nsfs/1901450] [ERROR] CONSOLE:: memory_monitor got an error Error: EMFILE: too many open files, uv_resident_set_memory at process.memoryUsage (node:internal/process/per_thread:168:5) at Timeout.memory_monitor [as _onTimeout] (/usr/local/noobaa-core/src/util/panic.js:51:27) at listOnTimeout (node:internal/timers:573:17) at process.processTimers (node:internal/timers:514:7) { errno: -24, code: 'EMFILE', syscall: 'uv_resident_set_memory' }

  1. Change default event logging - we default it to false and enable it only on NC NSFS. We wanted to add an event in the panic.js in case the main process of NC NSFS only, and we also noticed that currently, we have some of the "EVENTS" printed in the endpoint pod, which might need to be more accurate (for example - "host" field is with the pod's name):

[Endpoint/12] [EVENT]{"timestamp":"2024-10-31T06:02:17.105Z","host":"noobaa-endpoint-6f9745857-6dznn","event":{"code":"noobaa_started","message":"Noobaa started","description":"Noobaa started running","entity_type":"NODE","event_type":"STATE_CHANGE","scope":"NODE","severity":"INFO","state":"HEALTHY","pid":12}}

and also remove the message "Event logging not enabled" in case the events are not enabled (to avoid printing it in a containerized environment).

  1. Increase the number of LimitNOFILE in noobaa.service from 2^16 to 600k to set higher limitations on the open files - which are open files descriptors for reading files, websockets, etc.

Issues: Related to issue #8471

  1. Currently, with a high number of connections to one single node noobaa might be down due to "EMFILE" that probably more FDs are open during the memory_monitor and we want to avoid those cases.

GAPS:

  1. We still don't know in which cases the number of the the number of FDs on noobaa service main process (ls -al /proc/<PID number>/fd | wc -l) is increasing - using --concurrent=1000 there are cases when there is no major change, and cases when it is increasing to high numbers.
  2. We might want to change the printed events in s3 failures to be printed periodically to manage our logs printings better.
  3. Add the noobaa crash as an event in case it comes from the panic function.

Testing Instructions:

A. On GPFS machine we used a case that knows that leads to crash - with LimitNOFILE=15536 on the service and warp with --concurrent=3000 and --duration=10:

  1. Without the try-catch in memory_monitor - noobaa was down and we could see the "PANIC" printings and the error of "connect: connection refused", the number of FDs on noobaa service main process (ls -al /proc/<PID number>/fd | wc -l) was stuck around the defined limit "LimitNOFILE" and we killed the warp process and it was back to its initial number.
  2. With the try-catch in memory_monitor - noobaa was not down, we could see the "memory_monitor got an error Error: EMFILE: too many open files, uv_resident_set_memory" printings, but had warp issues "connection reset by peer" (the test ended with errors because not all requests were served, and the number of FDs on noobaa service main process (ls -al /proc/<PID number>/fd | wc -l) was back to its initial number.

B. On GPFS cluster change the LimitNOFILE to 600k:

  1. systemctl edit noobaa add the line LimitNOFILE=600000.
  2. systemctl daemon-reload.
  3. Restart the service in the cluster, we use mms3 config change DEBUGLEVEL="all".
  4. You can check with systemctl show noobaa | grep LimitNOFILE.

C. On Containerized:

  1. Build the images and install NooBaa system on Rancher Desktop (see guide).
  2. Check in the endpoint logs that you don't see "EVENT" printings.