Open rkomandu opened 1 week ago
@shirady can you please take a look ?
Hi @rkomandu , Could you please reproduce and provide logs with a higher debug level? I had issues with the GPFS machine and couldn't reproduce a whole test run.
At this moment I'm looking for a level-1 printing from here: https://github.com/noobaa/noobaa-core/blob/e1bf29e82128a9e0d359b262978d208ccff228e5/src/util/buffer_utils.js#L205
As you can see from the error stuck:
[ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)
It comes on GET request, in NSFS it is read_object_stream
after we tried to get_buffer
(a buffer from the memory to execute the read operation).
We want to see the values of the sem._waiting_value
and want to see the trend: Does it keep raising? Does it increase and decrease during the test? etc.
We know that it started with 0:
[3532906]: [nsfs/3532906] [L0] core.sdk.namespace_fs:: NamespaceFS: buffers_pool [ BufferPool.get_buffer: sem value: 2097152 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 33554432 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 425721856 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 3825205248 waiting_value: 0 buffers length: 0 ]
This printing comes from here: https://github.com/noobaa/noobaa-core/blob/e1bf29e82128a9e0d359b262978d208ccff228e5/src/util/buffer_utils.js#L217-L223 This warning printing might be in 2 minutes waiting - as configured here: https://github.com/noobaa/noobaa-core/blob/e1bf29e82128a9e0d359b262978d208ccff228e5/config.js#L789
BTW, Although it is a warning, I'm not sure why it is printed as with console.error
and not console.warn
, I can try and suggest a code change.
@shirady
I had issues with the GPFS machine and couldn't reproduce a whole test run.
lets see if we can get your cluster working for the run. For now kickstarted the run for 90min , debuglevel set to all for now
DEBUGLEVEL : all ENABLEMD5 : true
Will update once run complete
Note: High level thought, this bug might be recreated with the load on the system. Am saying this because the issue 8524 with versioning PUT method didn't run into this error when ran for 3hrs (ran on Wed)
@shirady
Ran for the 90 min Warp get op run and didn't run into the buffer_pool error
./warp get --insecure --duration 90m --host
[root@gui0 log]# zgrep "stuck" noobaa.log-20241115.gz [root@gui0 log]# grep "stuck" noobaa.log
[root@gui1 log]# zgrep "stuck" noobaa.log-20241115.gz [root@gui1 log]# grep "stuck" noobaa.log
Please try to check from the code flow perspective , as mentioned it could also be w/r/t load on the system
Hi,
I will share that I ran the warp twice and didn't reproduce it.
I didn't find the printing "Error: Warning stuck buffer_pool buffer" (or any "stuck" in the logs).
I ran it with a high debug level and didn't find a place where the waiting_value
is not 0, from this output:
https://github.com/noobaa/noobaa-core/blob/e1bf29e82128a9e0d359b262978d208ccff228e5/src/util/buffer_utils.js#L205
@rkomandu, I'm planning to check other things I will update you here about it.
Additional information:
I run:
./warp get --host=<IP-address-of-1-node> --access-key=<> --secret-key=<> --obj.size=256M --duration=60m --bucket=<bucket-name> --objects 1500 --insecure --tls
(after creating the bucket).
I had to set the --objects
due to space limits on the machine.
The outputs:
NSFS_CALCULATE_MD5
----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 367.05 MiB/s, 1.50 obj/s
Throughput, split into 192 x 5s:
* Fastest: 389.2MiB/s, 1.59 obj/s
* 50% Median: 368.5MiB/s, 1.51 obj/s
* Slowest: 325.8MiB/s, 1.33 obj/s
----------------------------------------
Operation: GET. Concurrency: 20
* Average: 849.39 MiB/s, 3.48 obj/s
Throughput, split into 239 x 15s:
* Fastest: 896.0MiB/s, 3.67 obj/s
* 50% Median: 851.6MiB/s, 3.49 obj/s
* Slowest: 774.0MiB/s, 3.17 obj/s
NSFS_CALCULATE_MD5
to true----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 161.74 MiB/s, 0.66 obj/s
Throughput, split into 141 x 15s:
* Fastest: 171.4MiB/s, 0.70 obj/s
* 50% Median: 162.1MiB/s, 0.66 obj/s
* Slowest: 148.0MiB/s, 0.61 obj/s
----------------------------------------
Operation: GET. Concurrency: 20
* Average: 821.38 MiB/s, 3.36 obj/s
Throughput, split into 238 x 15s:
* Fastest: 853.9MiB/s, 3.50 obj/s
* 50% Median: 823.2MiB/s, 3.37 obj/s
* Slowest: 741.7MiB/s, 3.04 obj/s
Environment info
noobaa-20241104 (5.17.1) - standalone noobaa
Actual behavior
./warp get --insecure --duration 60m --host.com:6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket warp-get-bucket-reg 2>&1| tee /tmp/warp-get-11nov2024.log
observed following in the log (system running concurrently long versioning test as well in other directory)
No errors on the client node is observed. I am saying around 03:54 , because the GPFS daemon has started back on the 1 node (out of 2 node protocol node) , where the RR-DNS is configured the IO continued to run when the HA happened previously. So this above message is nothing related to HA (will attach the logs)
Default endpoint forks in the system with 2 CES S3 nodes having 1 CES IP each assigned
Expected behavior
1.
Are we expected to get these ERRORS , as posted above ?
CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer
Steps to reproduce
1.
Run warp as shown below and it occurred on a system where it had medium workload i can say
More information - Screenshots / Logs / Other output
will update once the logs are uploaded https://ibm.ent.box.com/folder/293508364523