sct job failed due to jenkins disconnecting from the runner

scylladb / scylla-cluster-tests

Tests for Scylla Clusters

GNU Affero General Public License v3.0

57 stars 94 forks source link

sct job failed due to jenkins disconnecting from the runner #5082

Closed ShlomiBalalis closed 1 month ago

ShlomiBalalis commented 2 years ago

Prerequisites

[X] Are you rebased to master ?
[ ] Is it reproducible ?
[X] Did you perform a cursory search if this issue isn't opened ?

Versions

SCT: [branch or hash tag you are working on]
scylla: 5.0

Logs

test_id: d83fd5b6-f8e6-43c8-a384-fad38c81e3e9
job log: gs://scratch.scylladb.com/balalis/lost_connection_to_runner_output.log

Description

The job in question, https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/Shlomo/job/longevity-multi-keyspaces-60h-test/5/ failed after 34 hours for no reason related to the sct run itself. By that I mean that there was no critical or error event that would cause the run to fail, it was perfectly healthy. The run, however, failed due to jenkins losing its connection to the runner?

09:23:48  time="2022-07-28T05:58:27Z" level=error msg="error waiting for container: command [ssh -l ubuntu -- 44.192.100.73 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe\r\n"
09:23:48  Cleaning SSH agent
09:28:58  wrapper script does not seem to be touching the log file in /home/jenkins/slave/workspace/scylla-staging/Shlomo/longevity-multi-keyspaces-60h-test/scylla-cluster-tests@tmp/durable-ec98bbe7
09:28:58  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
09:28:59  [Pipeline] }
09:28:59  [Pipeline] // timeout
09:28:59  [Pipeline] }
09:28:59  [Pipeline] // dir
09:28:59  [Pipeline] }
09:28:59  [Pipeline] // wrap
09:28:59  [Pipeline] }
09:28:59  [Pipeline] // script
09:28:59  [Pipeline] }
09:28:59  ERROR: script returned exit code -1

(44.192.100.73 is the runner's IP)

fruch commented 2 years ago

on the same worker at the same time: aws-us-east-1-qa-builder-1

on a different job:

07:52:06  < t:2022-07-28 04:52:01,800 f:docker_utils.py l:466  c:RemoteLibSSH2CmdRunner p:INFO  > Login to Docker Hub as `scyllaqatest'
07:55:01  Cannot contact aws-us-east-1-qa-builder-1: java.lang.InterruptedException
09:23:48  Cleaning SSH agent
09:23:48  time="2022-07-28T06:18:57Z" level=error msg="error waiting for container: unexpected EOF"
09:23:48  Agent pid 804151 killed
09:28:58  wrapper script does not seem to be touching the log file in /home/jenkins/slave/workspace/enterprise-2022.1/longevity/longevity-lwt-3h-test/scylla-cluster-tests@tmp/durable-10ea62ec
09:28:58  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

https://jenkins.scylladb.com/job/enterprise-2022.1/job/longevity/job/longevity-lwt-3h-test/11/

ShlomiBalalis commented 2 years ago

Happened again in a staging job yesterday evening

21:37:56  time="2022-08-07T18:37:55Z" level=error msg="error waiting for container: command [ssh -l ubuntu -- 34.148.93.142 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Connection to 34.148.93.142 closed by remote host.\r\n"
21:37:58  command [ssh -l ubuntu -- 34.148.93.142 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Connection to 34.148.93.142 closed by remote host.

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/Shlomo/job/sct-feature-test-backup-gce-multidc/4/

ShlomiBalalis commented 2 years ago

Happened again at https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/Shlomo/job/sct-feature-test-backup-azure/17/

ShlomiBalalis commented 2 years ago

The issue issue has happened in two 2022.1.1 jobs:

Installation details

Kernel Version: 5.15.0-1015-aws Scylla version (or git commit hash): 2022.1.1-20220807.e1e2c3d21 with build-id c3ec7353aee613ea535370e90c7c1d7ddcdf3209 Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

longevity-lwt-3h-2022-1-db-node-43697a7f-5 (3.237.95.153 | 10.0.0.102) (shards: 8)
longevity-lwt-3h-2022-1-db-node-43697a7f-4 (3.239.63.158 | 10.0.3.114) (shards: 8)
longevity-lwt-3h-2022-1-db-node-43697a7f-3 (3.234.141.46 | 10.0.2.52) (shards: 8)
longevity-lwt-3h-2022-1-db-node-43697a7f-2 (34.232.44.231 | 10.0.2.200) (shards: 8)
longevity-lwt-3h-2022-1-db-node-43697a7f-1 (3.210.204.103 | 10.0.0.46) (shards: 8)

OS / Image: ami-01670e0a2b0d4d19b (aws: us-east-1)

Test: longevity-lwt-3h-test Test id: 43697a7f-e4f7-4f0b-ad62-57a7d0fe7659 Test name: enterprise-2022.1/longevity/longevity-lwt-3h-test Test config file(s):

longevity-lwt-basic-3h.yaml
Restore Monitor Stack command: $ hydra investigate show-monitor 43697a7f-e4f7-4f0b-ad62-57a7d0fe7659
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 43697a7f-e4f7-4f0b-ad62-57a7d0fe7659

Logs:

No logs captured during this run.

Jenkins job URL

And

Installation details

Kernel Version: 5.15.0-1015-aws Scylla version (or git commit hash): 2022.1.1-20220807.e1e2c3d21 with build-id c3ec7353aee613ea535370e90c7c1d7ddcdf3209 Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-100gb-4h-2022-1-db-node-0c5c8eda-7 (44.193.18.215 | 10.0.3.142) (shards: 14)
longevity-100gb-4h-2022-1-db-node-0c5c8eda-6 (44.196.47.5 | 10.0.3.170) (shards: 14)
longevity-100gb-4h-2022-1-db-node-0c5c8eda-5 (44.199.190.193 | 10.0.0.152) (shards: 14)
longevity-100gb-4h-2022-1-db-node-0c5c8eda-4 (3.239.31.197 | 10.0.2.36) (shards: 14)
longevity-100gb-4h-2022-1-db-node-0c5c8eda-3 (107.23.74.82 | 10.0.2.69) (shards: 14)
longevity-100gb-4h-2022-1-db-node-0c5c8eda-2 (3.237.174.55 | 10.0.0.64) (shards: 14)
longevity-100gb-4h-2022-1-db-node-0c5c8eda-1 (44.195.21.253 | 10.0.3.67) (shards: 14)

OS / Image: ami-01670e0a2b0d4d19b (aws: us-east-1)

Test: longevity-100gb-4h-test Test id: 0c5c8eda-c8fb-4e57-9f3c-45c87aa32860 Test name: enterprise-2022.1/longevity/longevity-100gb-4h-test Test config file(s):

longevity-100gb-4h.yaml
Restore Monitor Stack command: $ hydra investigate show-monitor 0c5c8eda-c8fb-4e57-9f3c-45c87aa32860
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 0c5c8eda-c8fb-4e57-9f3c-45c87aa32860

Logs:

No logs captured during this run.

Jenkins job URL

Do note that SOME of the logs were captured from those runs, just not to their entirety.

ShlomiBalalis commented 2 years ago

Reproduced again:

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 2022.2.0~rc1-20220902.a9bc6d191031 with build-id 074a0cb9e6a5ab36ba5e7f81385e68079ab6eeda

Cluster size: 5 nodes (i3en.3xlarge)

Scylla Nodes used in this run:

longevity-tls-2tb-48h-1dis-2nondis--db-node-310c42b4-6 (54.75.104.172 | 10.4.2.203) (shards: 10)
longevity-tls-2tb-48h-1dis-2nondis--db-node-310c42b4-5 (54.75.118.167 | 10.4.3.84) (shards: 10)
longevity-tls-2tb-48h-1dis-2nondis--db-node-310c42b4-4 (54.247.61.225 | 10.4.1.73) (shards: 10)
longevity-tls-2tb-48h-1dis-2nondis--db-node-310c42b4-3 (34.245.94.48 | 10.4.3.151) (shards: 10)
longevity-tls-2tb-48h-1dis-2nondis--db-node-310c42b4-2 (34.241.106.173 | 10.4.3.66) (shards: 10)
longevity-tls-2tb-48h-1dis-2nondis--db-node-310c42b4-1 (34.254.225.226 | 10.4.3.119) (shards: 10)

OS / Image: ami-03d21402486bbce67 (aws: eu-west-1)

Test: longevity-2TB-48h-authorization-and-tls-ssl-1dis-2nondis-nemesis-test Test id: 310c42b4-c7be-4d91-af5d-74981aac2905 Test name: enterprise-2022.2/longevity/longevity-2TB-48h-authorization-and-tls-ssl-1dis-2nondis-nemesis-test Test config file(s):

longevity-2TB-48h-authorization-and-tls-ssl-1dis-2nondis-nemesis.yaml

Issue description

18:54:59  time="2022-09-13T06:39:12Z" level=error msg="error waiting for container: command [ssh -l ubuntu -- 3.251.78.87 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe\r\n"
18:54:59  command [ssh -l ubuntu -- 3.251.78.87 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe
18:54:59  
18:54:59  Cleaning SSH agent
18:54:59  Agent pid 3094111 killed

Restore Monitor Stack command: $ hydra investigate show-monitor 310c42b4-c7be-4d91-af5d-74981aac2905
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 310c42b4-c7be-4d91-af5d-74981aac2905

Logs:

No logs captured during this run.

Jenkins job URL

ilya-rarov commented 1 year ago

Looks like I got it again here

Issue description

Found this in Jenkins console log.

[2023-02-25T14:21:23.359Z] time="2023-02-25T14:21:12Z" level=warning msg="commandConn.CloseWrite: commandconn: failed to wait: signal: killed"
[2023-02-25T14:21:23.359Z] time="2023-02-25T14:21:12Z" level=error msg="error waiting for container: context canceled"
[2023-02-25T14:21:23.359Z] command [ssh -l ubuntu -- 54.242.191.35 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe

How frequently does it reproduce?

Frequency is high: 5 of 7 last runs of this Job failed with similar errors

Installation details

Kernel Version: 5.15.0-1030-aws Scylla version (or git commit hash): 5.3.0~dev-20230221.d7b6cf045fbf with build-id 19fbad4a238b5ab9f8e00ca3aa25c940b4bce75d

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-aa3c6baf-7 (52.87.174.127 | 10.12.10.85) (shards: -1)
longevity-tls-50gb-3d-master-db-node-aa3c6baf-6 (3.93.31.229 | 10.12.8.138) (shards: 14)
longevity-tls-50gb-3d-master-db-node-aa3c6baf-5 (54.167.20.129 | 10.12.8.181) (shards: 14)
longevity-tls-50gb-3d-master-db-node-aa3c6baf-4 (18.206.199.93 | 10.12.10.207) (shards: 14)
longevity-tls-50gb-3d-master-db-node-aa3c6baf-3 (54.144.101.131 | 10.12.10.13) (shards: 14)
longevity-tls-50gb-3d-master-db-node-aa3c6baf-2 (54.160.63.59 | 10.12.11.157) (shards: 14)
longevity-tls-50gb-3d-master-db-node-aa3c6baf-1 (34.234.77.209 | 10.12.9.163) (shards: 14)

OS / Image: ami-065d0cb40cc02d82c (aws: us-east-1)

Test: longevity-50gb-3days-test Test id: aa3c6baf-5ee1-4aab-b552-e3975635ecb6 Test name: scylla-master/longevity/longevity-50gb-3days-test Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor aa3c6baf-5ee1-4aab-b552-e3975635ecb6` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=aa3c6baf-5ee1-4aab-b552-e3975635ecb6) - Show all stored logs command: `$ hydra investigate show-logs aa3c6baf-5ee1-4aab-b552-e3975635ecb6` ## Logs: *No logs captured during this run.* [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-50gb-3days-test/183/)

fruch commented 1 year ago

Looks like I got it again here

Issue description

Found this in Jenkins console log.
[2023-02-25T14:21:23.359Z] time="2023-02-25T14:21:12Z" level=warning msg="commandConn.CloseWrite: commandconn: failed to wait: signal: killed"
[2023-02-25T14:21:23.359Z] time="2023-02-25T14:21:12Z" level=error msg="error waiting for container: context canceled"
[2023-02-25T14:21:23.359Z] command [ssh -l ubuntu -- 54.242.191.35 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe
How frequently does it reproduce?

Frequency is high: 5 of 7 last runs of this Job failed with similar errors

Installation details

Kernel Version: 5.15.0-1030-aws Scylla version (or git commit hash): 5.3.0~dev-20230221.d7b6cf045fbf with build-id 19fbad4a238b5ab9f8e00ca3aa25c940b4bce75d

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-aa3c6baf-7 (52.87.174.127 | 10.12.10.85) (shards: -1)

longevity-tls-50gb-3d-master-db-node-aa3c6baf-6 (3.93.31.229 | 10.12.8.138) (shards: 14)

longevity-tls-50gb-3d-master-db-node-aa3c6baf-5 (54.167.20.129 | 10.12.8.181) (shards: 14)

longevity-tls-50gb-3d-master-db-node-aa3c6baf-4 (18.206.199.93 | 10.12.10.207) (shards: 14)

longevity-tls-50gb-3d-master-db-node-aa3c6baf-3 (54.144.101.131 | 10.12.10.13) (shards: 14)

longevity-tls-50gb-3d-master-db-node-aa3c6baf-2 (54.160.63.59 | 10.12.11.157) (shards: 14)

longevity-tls-50gb-3d-master-db-node-aa3c6baf-1 (34.234.77.209 | 10.12.9.163) (shards: 14)

OS / Image: ami-065d0cb40cc02d82c (aws: us-east-1)

Test: longevity-50gb-3days-test Test id: aa3c6baf-5ee1-4aab-b552-e3975635ecb6 Test name: scylla-master/longevity/longevity-50gb-3days-test Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor aa3c6baf-5ee1-4aab-b552-e3975635ecb6` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=aa3c6baf-5ee1-4aab-b552-e3975635ecb6) - Show all stored logs command: `$ hydra investigate show-logs aa3c6baf-5ee1-4aab-b552-e3975635ecb6` ## Logs: *No logs captured during this run.* [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-50gb-3days-test/183/)

In your case of the 50gb, it's caused by the SCT is getting out of disk space, and out of memory, hence connection to it is lost. there are multiple issues opened regarding that issue.

ShlomiBalalis commented 1 year ago

It happened again:

Issue description

18:31:29 command [ssh -l ubuntu -- 3.94.31.185 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe

Installation details

Kernel Version: 5.15.0-1039-aws Scylla version (or git commit hash): 2022.2.11-20230705.27d29485de90 with build-id f467a0ad8869d61384d8bbc8f20e4fb8fd281f4b

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

longevity-5gb-1h-CorruptThenRepairM-db-node-0e492691-3 (44.192.8.237 | 10.12.3.232) (shards: 7)
longevity-5gb-1h-CorruptThenRepairM-db-node-0e492691-2 (44.215.102.194 | 10.12.0.55) (shards: 7)
longevity-5gb-1h-CorruptThenRepairM-db-node-0e492691-1 (35.172.232.44 | 10.12.0.198) (shards: 7)

OS / Image: ami-035cd5dfc7ca3cceb (aws: undefined_region)

Test: longevity-5gb-1h-CorruptThenRepairMonkey-aws-test_14676-recreation Test id: 0e492691-a0a3-478b-89b8-75af0d5eea1b Test name: scylla-staging/Shlomo/longevity-5gb-1h-CorruptThenRepairMonkey-aws-test_14676-recreation Test config file(s):

longevity-5gb-1h-CorruptThenRepairMonkey.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor 0e492691-a0a3-478b-89b8-75af0d5eea1b` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=0e492691-a0a3-478b-89b8-75af0d5eea1b) - Show all stored logs command: `$ hydra investigate show-logs 0e492691-a0a3-478b-89b8-75af0d5eea1b` ## Logs: - **db-cluster-0e492691.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/db-cluster-0e492691.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/db-cluster-0e492691.tar.gz) - **sct-runner-events-0e492691.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/sct-runner-events-0e492691.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/sct-runner-events-0e492691.tar.gz) - **sct-0e492691.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/sct-0e492691.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/sct-0e492691.log.tar.gz) - **loader-set-0e492691.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/loader-set-0e492691.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/loader-set-0e492691.tar.gz) - **monitor-set-0e492691.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/monitor-set-0e492691.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/monitor-set-0e492691.tar.gz) - **parallel-timelines-report-0e492691.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/parallel-timelines-report-0e492691.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e492691-a0a3-478b-89b8-75af0d5eea1b/20230907_153150/parallel-timelines-report-0e492691.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/Shlomo/job/longevity-5gb-1h-CorruptThenRepairMonkey-aws-test_14676-recreation/7/) [Argus](https://argus.scylladb.com/test/cf519435-d81b-44a8-a3be-f5ffc827d744/runs?additionalRuns[]=0e492691-a0a3-478b-89b8-75af0d5eea1b)

fruch commented 1 month ago

this one is from a year ago,

I suspect it was issue cause by sct running regexes on very long log lines, we fixed that issue, and such crashes of sct-runners weren't seen since then.