noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
273 stars 80 forks source link

Error: Warning stuck buffer_pool buffer with IO workload #8521

Open rkomandu opened 1 week ago

rkomandu commented 1 week ago

Environment info

noobaa-20241104 (5.17.1) - standalone noobaa

Actual behavior

  1. Ran the Warp IO workload of "get"

./warp get --insecure --duration 60m --host .com:6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket warp-get-bucket-reg 2>&1| tee /tmp/warp-get-11nov2024.log

observed following in the log (system running concurrently long versioning test as well in other directory)

Nov 11 03:54:17 node-gui0 [3532906]: [nsfs/3532906] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer    at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25)    at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46)    at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)    at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25)    at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19)    at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)
grep "Error: Warning stuck buffer_pool buffer" noobaa.log | grep "03:54" |wc -l
30

No errors on the client node is observed. I am saying around 03:54 , because the GPFS daemon has started back on the 1 node (out of 2 node protocol node) , where the RR-DNS is configured the IO continued to run when the HA happened previously. So this above message is nothing related to HA (will attach the logs)


Warp analyze  --> ignore 10 errors of "read tcp" as this is expected since the HA is happening (i.e gpfs is starting) 

./warp analyze --analyze.op GET --analyze.v  warp-get-2024-11-11[030028]-5qrv.csv.zst --debug
7550 operations loaded... Done!

----------------------------------------
Operation: GET (5050). Ran 1h0m4s. Size: 256000000 bytes. Concurrency: 20.
Errors: 15
First Errors:
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59180->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59164->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59270->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59244->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59256->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59258->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59178->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59166->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59222->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59234->9.42.93.99:6443: read: connection reset by peer

Requests considered: 5014:
 * Avg: 14.302s, 50%: 17.271s, 90%: 21.121s, 99%: 23.583s, Fastest: 1.486s, Slowest: 1m54.959s, StdDev: 9.26s
 * TTFB: Avg: 141ms, Best: 11ms, 25th: 55ms, Median: 73ms, 75th: 93ms, 90th: 118ms, 99th: 216ms, Worst: 1m35.183s StdDev: 2.096s
 * First Access: Avg: 17.059s, 50%: 19.136s, 90%: 21.355s, 99%: 23.466s, Fastest: 1.486s, Slowest: 1m52.859s, StdDev: 8.407s
 * First Access TTFB: Avg: 152ms, Best: 21ms, 25th: 59ms, Median: 78ms, 75th: 101ms, 90th: 130ms, 99th: 335ms, Worst: 1m33.717s StdDev: 2.18s
 * Last Access: Avg: 10.997s, 50%: 7.742s, 90%: 20.381s, 99%: 22.633s, Fastest: 1.486s, Slowest: 1m52.228s, StdDev: 6.788s
 * Last Access TTFB: Avg: 91ms, Best: 16ms, 25th: 52ms, Median: 69ms, 75th: 86ms, 90th: 108ms, 99th: 192ms, Worst: 36.283s StdDev: 782ms

Throughput:
* Average: 340.77 MiB/s, 1.40 obj/s

Throughput, split into 239 x 15s:
 * Fastest: 678.2MiB/s, 2.78 obj/s (15s, starting 03:57:01 EST)
 * 50% Median: 248.9MiB/s, 1.02 obj/s (15s, starting 03:26:16 EST)
 * Slowest: 43.6MiB/s, 0.18 obj/s (15s, starting 03:14:16 EST)

Default endpoint forks in the system with 2 CES S3 nodes having 1 CES IP each assigned

Expected behavior

1.

Are we expected to get these ERRORS , as posted above ?

CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer

Steps to reproduce

1.

Run warp as shown below and it occurred on a system where it had medium workload i can say

More information - Screenshots / Logs / Other output

will update once the logs are uploaded https://ibm.ent.box.com/folder/293508364523

romayalon commented 1 week ago

@shirady can you please take a look ?

shirady commented 1 week ago

Hi @rkomandu , Could you please reproduce and provide logs with a higher debug level? I had issues with the GPFS machine and couldn't reproduce a whole test run.

Additional Details

At this moment I'm looking for a level-1 printing from here: https://github.com/noobaa/noobaa-core/blob/e1bf29e82128a9e0d359b262978d208ccff228e5/src/util/buffer_utils.js#L205

As you can see from the error stuck:

[ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)

It comes on GET request, in NSFS it is read_object_stream after we tried to get_buffer (a buffer from the memory to execute the read operation).

We want to see the values of the sem._waiting_value and want to see the trend: Does it keep raising? Does it increase and decrease during the test? etc.

We know that it started with 0:

[3532906]: [nsfs/3532906] [L0] core.sdk.namespace_fs:: NamespaceFS: buffers_pool [ BufferPool.get_buffer: sem value: 2097152 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 33554432 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 425721856 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 3825205248 waiting_value: 0 buffers length: 0 ]

This printing comes from here: https://github.com/noobaa/noobaa-core/blob/e1bf29e82128a9e0d359b262978d208ccff228e5/src/util/buffer_utils.js#L217-L223 This warning printing might be in 2 minutes waiting - as configured here: https://github.com/noobaa/noobaa-core/blob/e1bf29e82128a9e0d359b262978d208ccff228e5/config.js#L789

BTW, Although it is a warning, I'm not sure why it is printed as with console.error and not console.warn, I can try and suggest a code change.

rkomandu commented 1 week ago

@shirady

I had issues with the GPFS machine and couldn't reproduce a whole test run.

lets see if we can get your cluster working for the run. For now kickstarted the run for 90min , debuglevel set to all for now

DEBUGLEVEL : all ENABLEMD5 : true

Will update once run complete

Note: High level thought, this bug might be recreated with the load on the system. Am saying this because the issue 8524 with versioning PUT method didn't run into this error when ran for 3hrs (ran on Wed)

rkomandu commented 1 week ago

@shirady

Ran for the 90 min Warp get op run and didn't run into the buffer_pool error

./warp get --insecure --duration 90m --host :6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket newbucket-warp-get-8521-15nov 2>&1| tee /tmp/newbucket-warp-get-255M-8521-15nov.log

[root@gui0 log]# zgrep "stuck" noobaa.log-20241115.gz [root@gui0 log]# grep "stuck" noobaa.log

[root@gui1 log]# zgrep "stuck" noobaa.log-20241115.gz [root@gui1 log]# grep "stuck" noobaa.log

rkomandu commented 1 week ago

Please try to check from the code flow perspective , as mentioned it could also be w/r/t load on the system

shirady commented 5 days ago

Hi, I will share that I ran the warp twice and didn't reproduce it. I didn't find the printing "Error: Warning stuck buffer_pool buffer" (or any "stuck" in the logs). I ran it with a high debug level and didn't find a place where the waiting_value is not 0, from this output: https://github.com/noobaa/noobaa-core/blob/e1bf29e82128a9e0d359b262978d208ccff228e5/src/util/buffer_utils.js#L205

@rkomandu, I'm planning to check other things I will update you here about it.

Additional information:

I run: ./warp get --host=<IP-address-of-1-node> --access-key=<> --secret-key=<> --obj.size=256M --duration=60m --bucket=<bucket-name> --objects 1500 --insecure --tls (after creating the bucket). I had to set the --objects due to space limits on the machine.

The outputs:

  1. Without changing the NSFS_CALCULATE_MD5
----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 367.05 MiB/s, 1.50 obj/s

Throughput, split into 192 x 5s:
 * Fastest: 389.2MiB/s, 1.59 obj/s
 * 50% Median: 368.5MiB/s, 1.51 obj/s
 * Slowest: 325.8MiB/s, 1.33 obj/s

----------------------------------------
Operation: GET. Concurrency: 20
* Average: 849.39 MiB/s, 3.48 obj/s

Throughput, split into 239 x 15s:
 * Fastest: 896.0MiB/s, 3.67 obj/s
 * 50% Median: 851.6MiB/s, 3.49 obj/s
 * Slowest: 774.0MiB/s, 3.17 obj/s
  1. Changing the NSFS_CALCULATE_MD5 to true
----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 161.74 MiB/s, 0.66 obj/s

Throughput, split into 141 x 15s:
 * Fastest: 171.4MiB/s, 0.70 obj/s
 * 50% Median: 162.1MiB/s, 0.66 obj/s
 * Slowest: 148.0MiB/s, 0.61 obj/s

----------------------------------------
Operation: GET. Concurrency: 20
* Average: 821.38 MiB/s, 3.36 obj/s

Throughput, split into 238 x 15s:
 * Fastest: 853.9MiB/s, 3.50 obj/s
 * 50% Median: 823.2MiB/s, 3.37 obj/s
 * Slowest: 741.7MiB/s, 3.04 obj/s