Open khanhn1638 opened 7 months ago
@romayalon @guymguym
This issue had no activity for too long - it will now be labeled stale. Update it to prevent it from getting closed.
@khanhn1638 we have some mechanisms to handle AFM long waiting operations. First we have warmup read which was added a while ago (perhaps even before your issue) in #7307. Second we added separation of buffers for small objects in #7609. Would you be able to reproduce this again on latest 5.15 version? Thanks
Environment info
Actual behavior
The setup is with watsonx.data talking to Noobaa and Noobaa sits on top of an AFM-S3 fileset where the remote cloud is IBM Cloud. In read intensity workloads from watsonx.data, it can be seen that NSFS hangs and no longer communicates via S3 to watsonx.data (gets connection timeout)
Expected behavior
No hang
Steps to reproduce
Setup watsonx.data, non-containerized noobaa, and AFM-S3.
The "query" that I can cause the failure with (recreated 2x) is with this: analyze accelerated.sf1000.store_sales
More information - Screenshots / Logs / Other output
in the 1st creation, the server was really hung and only thing i could do was get the /var/log/messages. Im not sure that this helps much but here's a snippet: Dec 5 13:07:43 techx-wxd-storage node[432167]: Dec-5 13:07:43.378 [nsfs/432167] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[432167]: Dec-5 13:07:43.378 [nsfs/432167] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[430557]: [nsfs/430557] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:215:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:957:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:113:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:149:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:65:9) Dec 5 13:07:43 techx-wxd-storage node[430557]: Dec-5 13:07:43.430 [nsfs/430557] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[430557]: Dec-5 13:07:43.430 [nsfs/430557] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[431639]: [nsfs/431639] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:215:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:957:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:113:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:149:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:65:9) Dec 5 13:07:43 techx-wxd-storage node[431639]: Dec-5 13:07:43.532 [nsfs/431639] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:43 techx-wxd-storage node[431639]: Dec-5 13:07:43.532 [nsfs/431639] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:45 techx-wxd-storage node[431137]: [nsfs/431137] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:215:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:957:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:113:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:149:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:65:9) Dec 5 13:07:45 techx-wxd-storage node[431137]: Dec-5 13:07:45.042 [nsfs/431137] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer Dec 5 13:07:45 techx-wxd-storage node[431137]: Dec-5 13:07:45.042 [nsfs/431137] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer
from a gpfs pov, it had long waiters: [root@techx-wxd-storage bin]# mmhealth node show
Node name: techx-wxd-storage.fyre.ibm.com Node status: DEGRADED Status Change: 10 min. ago
Component Status Status Change Reasons & Notices
GPFS DEGRADED 10 min. ago longwaiters_found NETWORK HEALTHY 3 days ago - FILESYSTEM HEALTHY 3 days ago - AFM HEALTHY 1 day ago - FILESYSMGR HEALTHY 3 days ago -
on the 2nd recreate, i was able to get a gcore dump of noobaa process
of course, these are fairly large so where's the best place to upload them to? if ibmer, please contact me and i can give you access to these logs somehow