"The optional callback parameter will be executed when the data is finally written out, which may not be immediately."
Add try-catch clause in memory_monitor function - during the investigation of the issue mentioned below we noticed that there were cases when we used a high number of connections, noobaa was down due to error "EMFILE" and in the error stack, we saw that the memory_monitor function throw it from the process.memoryUsage() which is a node js function. The error stack for example:
Nov-11 8:14:10.250 [nsfs/1901450] [ERROR] CONSOLE:: memory_monitor got an error Error: EMFILE: too many open files, uv_resident_set_memory
at process.memoryUsage (node:internal/process/per_thread:168:5)
at Timeout.memory_monitor [as _onTimeout] (/usr/local/noobaa-core/src/util/panic.js:51:27)
at listOnTimeout (node:internal/timers:573:17)
at process.processTimers (node:internal/timers:514:7) {
errno: -24,
code: 'EMFILE',
syscall: 'uv_resident_set_memory'
}
Change default event logging - we default it to false and enable it only on NC NSFS. We wanted to add an event in the panic.js in case the main process of NC NSFS only, and we also noticed that currently, we have some of the "EVENTS" printed in the endpoint pod, which might need to be more accurate (for example - "host" field is with the pod's name):
[Endpoint/12] [EVENT]{"timestamp":"2024-10-31T06:02:17.105Z","host":"noobaa-endpoint-6f9745857-6dznn","event":{"code":"noobaa_started","message":"Noobaa started","description":"Noobaa started running","entity_type":"NODE","event_type":"STATE_CHANGE","scope":"NODE","severity":"INFO","state":"HEALTHY","pid":12}}
and also remove the message "Event logging not enabled" in case the events are not enabled (to avoid printing it in a containerized environment).
Increase the number of LimitNOFILE in noobaa.service from 2^16 to 600k to set higher limitations on the open files - which are open files descriptors for reading files, websockets, etc.
Issues: Related to issue #8471
Currently, with a high number of connections to one single node noobaa might be down due to "EMFILE" that probably more FDs are open during the memory_monitor and we want to avoid those cases.
GAPS:
We still don't know in which cases the number of the the number of FDs on noobaa service main process (ls -al /proc/<PID number>/fd | wc -l) is increasing - using --concurrent=1000 there are cases when there is no major change, and cases when it is increasing to high numbers.
We might want to change the printed events in s3 failures to be printed periodically to manage our logs printings better.
Add the noobaa crash as an event in case it comes from the panic function.
Testing Instructions:
A. On GPFS machine we used a case that knows that leads to crash - with LimitNOFILE=15536 on the service and warp with --concurrent=3000 and --duration=10:
Without the try-catch in memory_monitor - noobaa was down and we could see the "PANIC" printings and the error of "connect: connection refused", the number of FDs on noobaa service main process (ls -al /proc/<PID number>/fd | wc -l) was stuck around the defined limit "LimitNOFILE" and we killed the warp process and it was back to its initial number.
With the try-catch in memory_monitor - noobaa was not down, we could see the "memory_monitor got an error Error: EMFILE: too many open files, uv_resident_set_memory" printings, but had warp issues "connection reset by peer" (the test ended with errors because not all requests were served, and the number of FDs on noobaa service main process (ls -al /proc/<PID number>/fd | wc -l) was back to its initial number.
B. On GPFS cluster change the LimitNOFILE to 600k:
systemctl edit noobaa add the line LimitNOFILE=600000.
systemctl daemon-reload.
Restart the service in the cluster, we use mms3 config change DEBUGLEVEL="all".
You can check with systemctl show noobaa | grep LimitNOFILE.
C. On Containerized:
Build the images and install NooBaa system on Rancher Desktop (see guide).
Check in the endpoint logs that you don't see "EVENT" printings.
Explain the changes
memory_monitor
function - during the investigation of the issue mentioned below we noticed that there were cases when we used a high number of connections, noobaa was down due to error "EMFILE" and in the error stack, we saw that thememory_monitor
function throw it from theprocess.memoryUsage()
which is a node js function. The error stack for example:panic.js
in case the main process of NC NSFS only, and we also noticed that currently, we have some of the "EVENTS" printed in the endpoint pod, which might need to be more accurate (for example - "host" field is with the pod's name):and also remove the message "Event logging not enabled" in case the events are not enabled (to avoid printing it in a containerized environment).
LimitNOFILE
innoobaa.service
from 2^16 to 600k to set higher limitations on the open files - which are open files descriptors for reading files, websockets, etc.Issues: Related to issue #8471
memory_monitor
and we want to avoid those cases.GAPS:
ls -al /proc/<PID number>/fd | wc -l
) is increasing - using--concurrent=1000
there are cases when there is no major change, and cases when it is increasing to high numbers.Testing Instructions:
A. On GPFS machine we used a case that knows that leads to crash - with
LimitNOFILE=15536
on the service and warp with--concurrent=3000
and--duration=10
:memory_monitor
- noobaa was down and we could see the "PANIC" printings and the error of "connect: connection refused", the number of FDs on noobaa service main process (ls -al /proc/<PID number>/fd | wc -l
) was stuck around the defined limit "LimitNOFILE" and we killed the warp process and it was back to its initial number.memory_monitor
- noobaa was not down, we could see the "memory_monitor got an error Error: EMFILE: too many open files, uv_resident_set_memory" printings, but had warp issues "connection reset by peer" (the test ended with errors because not all requests were served, and the number of FDs on noobaa service main process (ls -al /proc/<PID number>/fd | wc -l
) was back to its initial number.B. On GPFS cluster change the
LimitNOFILE
to 600k:systemctl edit noobaa
add the lineLimitNOFILE=600000
.systemctl daemon-reload
.mms3 config change DEBUGLEVEL="all"
.systemctl show noobaa | grep LimitNOFILE
.C. On Containerized: