scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

`node_exporter` may hang on a DB node with the `error encoding and sending metric family: write tcp %IP%:9100` error #8692

Open vponomaryov opened 3 weeks ago

vponomaryov commented 3 weeks ago

Issue description

Setting up 2023.1.11 Scylla version one of the nodes hung with the following errors:

2024-09-13T11:33:57.764+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1     !INFO | scylla[15426]:  \
    [shard  0] stream_session - [Stream #10657cf0-71c4-11ef-830a-21e3b321ba22] Streaming plan for Bootstrap-system_distributed-index-10 succeeded, peers={10.142.0.14}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
2024-09-13T11:34:01.006+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1     !INFO | node_exporter[14047]: \
    ts=2024-09-13T11:34:00.709Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 10.142.0.10:9100" msg="->10.142.0.22:60390: write: broken pipe"
2024-09-13T11:34:01.017+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1     !INFO | node_exporter[14047]: \
    ts=2024-09-13T11:34:00.728Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 10.142.0.10:9100" msg="->10.142.0.22:60390: write: broken pipe"
...
2024-09-13T12:31:19.282+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1     !INFO | node_exporter[14047]: \
    ts=2024-09-13T12:31:19.031Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 10.142.0.10:9100" msg="->10.142.0.22:33436: write: broken pipe"

Later CI job was aborted.

Steps to Reproduce

  1. Setup custom_d1 (with special disk config) 3-node DB cluster
  2. See error
  3. [and so on...]

Expected behavior: node exporter must always be working correctly.

Actual behavior: node exporter may randomly hang.

Impact

Setup of a DB nodes hangs making a test run be spoiled.

How frequently does it reproduce?

~3/11 test runs. It is too frequent.

Installation details

SCT Version: master Scylla version (or git commit hash): 2023.1.11-0.20240729.5a79e79a0320 with build-id 4daf2e1487b1ab784ff564a6c8fd75f9ddd8a9ac

Logs

fruch commented 2 weeks ago

@vponomaryov

if it's the node_exporter on the DB node, I think it's something that needs to be reported on scylla core...

vponomaryov commented 2 weeks ago

@vponomaryov

if it's the node_exporter on the DB node, I think it's something that needs to be reported on scylla core...

We have a lot of configuration code for it in SCT.