noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
258 stars 77 forks source link

NSFS Noobaa hang when reading from AFM cache fileset #7644

Open khanhn1638 opened 7 months ago

khanhn1638 commented 7 months ago

Environment info

Actual behavior

The setup is with watsonx.data talking to Noobaa and Noobaa sits on top of an AFM-S3 fileset where the remote cloud is IBM Cloud. In read intensity workloads from watsonx.data, it can be seen that NSFS hangs and no longer communicates via S3 to watsonx.data (gets connection timeout)

Expected behavior

No hang

Steps to reproduce

Setup watsonx.data, non-containerized noobaa, and AFM-S3.
The "query" that I can cause the failure with (recreated 2x) is with this: analyze accelerated.sf1000.store_sales

More information - Screenshots / Logs / Other output

in the 1st creation, the server was really hung and only thing i could do was get the /var/log/messages. Im not sure that this helps much but here's a snippet: Dec 5 13:07:43 techx-wxd-storage node[432167]: Dec-5 13:07:43.378 [nsfs/432167] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[432167]: Dec-5 13:07:43.378 [nsfs/432167] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[430557]: [nsfs/430557] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:215:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:957:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:113:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:149:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:65:9) Dec 5 13:07:43 techx-wxd-storage node[430557]: Dec-5 13:07:43.430 [nsfs/430557] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[430557]: Dec-5 13:07:43.430 [nsfs/430557] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[431639]: [nsfs/431639] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:215:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:957:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:113:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:149:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:65:9) Dec 5 13:07:43 techx-wxd-storage node[431639]: Dec-5 13:07:43.532 [nsfs/431639] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[431639]: Dec-5 13:07:43.532 [nsfs/431639] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:45 techx-wxd-storage node[431137]: [nsfs/431137] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:215:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:957:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:113:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:149:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:65:9) Dec 5 13:07:45 techx-wxd-storage node[431137]: Dec-5 13:07:45.042 [nsfs/431137] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:45 techx-wxd-storage node[431137]: Dec-5 13:07:45.042 [nsfs/431137] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer

from a gpfs pov, it had long waiters: [root@techx-wxd-storage bin]# mmhealth node show

Node name: techx-wxd-storage.fyre.ibm.com Node status: DEGRADED Status Change: 10 min. ago

Component Status Status Change Reasons & Notices

GPFS DEGRADED 10 min. ago longwaiters_found NETWORK HEALTHY 3 days ago - FILESYSTEM HEALTHY 3 days ago - AFM HEALTHY 1 day ago - FILESYSMGR HEALTHY 3 days ago -

on the 2nd recreate, i was able to get a gcore dump of noobaa process

of course, these are fairly large so where's the best place to upload them to? if ibmer, please contact me and i can give you access to these logs somehow

khanhn1638 commented 7 months ago

@romayalon @guymguym

github-actions[bot] commented 2 months ago

This issue had no activity for too long - it will now be labeled stale. Update it to prevent it from getting closed.

guymguym commented 2 months ago

@khanhn1638 we have some mechanisms to handle AFM long waiting operations. First we have warmup read which was added a while ago (perhaps even before your issue) in #7307. Second we added separation of buffers for small objects in #7609. Would you be able to reproduce this again on latest 5.15 version? Thanks