Closed ShlomiBalalis closed 1 month ago
on the same worker at the same time: aws-us-east-1-qa-builder-1
on a different job:
07:52:06 < t:2022-07-28 04:52:01,800 f:docker_utils.py l:466 c:RemoteLibSSH2CmdRunner p:INFO > Login to Docker Hub as `scyllaqatest'
07:55:01 Cannot contact aws-us-east-1-qa-builder-1: java.lang.InterruptedException
09:23:48 Cleaning SSH agent
09:23:48 time="2022-07-28T06:18:57Z" level=error msg="error waiting for container: unexpected EOF"
09:23:48 Agent pid 804151 killed
09:28:58 wrapper script does not seem to be touching the log file in /home/jenkins/slave/workspace/enterprise-2022.1/longevity/longevity-lwt-3h-test/scylla-cluster-tests@tmp/durable-10ea62ec
09:28:58 (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
https://jenkins.scylladb.com/job/enterprise-2022.1/job/longevity/job/longevity-lwt-3h-test/11/
Happened again in a staging job yesterday evening
21:37:56 time="2022-08-07T18:37:55Z" level=error msg="error waiting for container: command [ssh -l ubuntu -- 34.148.93.142 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Connection to 34.148.93.142 closed by remote host.\r\n"
21:37:58 command [ssh -l ubuntu -- 34.148.93.142 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Connection to 34.148.93.142 closed by remote host.
The issue issue has happened in two 2022.1.1 jobs:
Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 2022.1.1-20220807.e1e2c3d21
with build-id c3ec7353aee613ea535370e90c7c1d7ddcdf3209
Cluster size: 4 nodes (i3.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-01670e0a2b0d4d19b
(aws: us-east-1)
Test: longevity-lwt-3h-test
Test id: 43697a7f-e4f7-4f0b-ad62-57a7d0fe7659
Test name: enterprise-2022.1/longevity/longevity-lwt-3h-test
Test config file(s):
Restore Monitor Stack command: $ hydra investigate show-monitor 43697a7f-e4f7-4f0b-ad62-57a7d0fe7659
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 43697a7f-e4f7-4f0b-ad62-57a7d0fe7659
No logs captured during this run.
And
Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 2022.1.1-20220807.e1e2c3d21
with build-id c3ec7353aee613ea535370e90c7c1d7ddcdf3209
Cluster size: 6 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image: ami-01670e0a2b0d4d19b
(aws: us-east-1)
Test: longevity-100gb-4h-test
Test id: 0c5c8eda-c8fb-4e57-9f3c-45c87aa32860
Test name: enterprise-2022.1/longevity/longevity-100gb-4h-test
Test config file(s):
Restore Monitor Stack command: $ hydra investigate show-monitor 0c5c8eda-c8fb-4e57-9f3c-45c87aa32860
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 0c5c8eda-c8fb-4e57-9f3c-45c87aa32860
No logs captured during this run.
Do note that SOME of the logs were captured from those runs, just not to their entirety.
Reproduced again:
Kernel Version: 5.15.0-1019-aws
Scylla version (or git commit hash): 2022.2.0~rc1-20220902.a9bc6d191031
with build-id 074a0cb9e6a5ab36ba5e7f81385e68079ab6eeda
Cluster size: 5 nodes (i3en.3xlarge)
Scylla Nodes used in this run:
OS / Image: ami-03d21402486bbce67
(aws: eu-west-1)
Test: longevity-2TB-48h-authorization-and-tls-ssl-1dis-2nondis-nemesis-test
Test id: 310c42b4-c7be-4d91-af5d-74981aac2905
Test name: enterprise-2022.2/longevity/longevity-2TB-48h-authorization-and-tls-ssl-1dis-2nondis-nemesis-test
Test config file(s):
18:54:59 time="2022-09-13T06:39:12Z" level=error msg="error waiting for container: command [ssh -l ubuntu -- 3.251.78.87 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe\r\n"
18:54:59 command [ssh -l ubuntu -- 3.251.78.87 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe
18:54:59
18:54:59 Cleaning SSH agent
18:54:59 Agent pid 3094111 killed
$ hydra investigate show-monitor 310c42b4-c7be-4d91-af5d-74981aac2905
$ hydra investigate show-logs 310c42b4-c7be-4d91-af5d-74981aac2905
No logs captured during this run.
Looks like I got it again here
Found this in Jenkins console log.
[2023-02-25T14:21:23.359Z] time="2023-02-25T14:21:12Z" level=warning msg="commandConn.CloseWrite: commandconn: failed to wait: signal: killed"
[2023-02-25T14:21:23.359Z] time="2023-02-25T14:21:12Z" level=error msg="error waiting for container: context canceled"
[2023-02-25T14:21:23.359Z] command [ssh -l ubuntu -- 54.242.191.35 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe
Frequency is high: 5 of 7 last runs of this Job failed with similar errors
Kernel Version: 5.15.0-1030-aws
Scylla version (or git commit hash): 5.3.0~dev-20230221.d7b6cf045fbf
with build-id 19fbad4a238b5ab9f8e00ca3aa25c940b4bce75d
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image: ami-065d0cb40cc02d82c
(aws: us-east-1)
Test: longevity-50gb-3days-test
Test id: aa3c6baf-5ee1-4aab-b552-e3975635ecb6
Test name: scylla-master/longevity/longevity-50gb-3days-test
Test config file(s):
Looks like I got it again here
Issue description
Found this in Jenkins console log.
[2023-02-25T14:21:23.359Z] time="2023-02-25T14:21:12Z" level=warning msg="commandConn.CloseWrite: commandconn: failed to wait: signal: killed" [2023-02-25T14:21:23.359Z] time="2023-02-25T14:21:12Z" level=error msg="error waiting for container: context canceled" [2023-02-25T14:21:23.359Z] command [ssh -l ubuntu -- 54.242.191.35 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe
How frequently does it reproduce?
Frequency is high: 5 of 7 last runs of this Job failed with similar errors
Installation details
Kernel Version: 5.15.0-1030-aws Scylla version (or git commit hash):
5.3.0~dev-20230221.d7b6cf045fbf
with build-id19fbad4a238b5ab9f8e00ca3aa25c940b4bce75d
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
- longevity-tls-50gb-3d-master-db-node-aa3c6baf-7 (52.87.174.127 | 10.12.10.85) (shards: -1)
- longevity-tls-50gb-3d-master-db-node-aa3c6baf-6 (3.93.31.229 | 10.12.8.138) (shards: 14)
- longevity-tls-50gb-3d-master-db-node-aa3c6baf-5 (54.167.20.129 | 10.12.8.181) (shards: 14)
- longevity-tls-50gb-3d-master-db-node-aa3c6baf-4 (18.206.199.93 | 10.12.10.207) (shards: 14)
- longevity-tls-50gb-3d-master-db-node-aa3c6baf-3 (54.144.101.131 | 10.12.10.13) (shards: 14)
- longevity-tls-50gb-3d-master-db-node-aa3c6baf-2 (54.160.63.59 | 10.12.11.157) (shards: 14)
- longevity-tls-50gb-3d-master-db-node-aa3c6baf-1 (34.234.77.209 | 10.12.9.163) (shards: 14)
OS / Image:
ami-065d0cb40cc02d82c
(aws: us-east-1)Test:
longevity-50gb-3days-test
Test id:aa3c6baf-5ee1-4aab-b552-e3975635ecb6
Test name:scylla-master/longevity/longevity-50gb-3days-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor aa3c6baf-5ee1-4aab-b552-e3975635ecb6` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=aa3c6baf-5ee1-4aab-b552-e3975635ecb6) - Show all stored logs command: `$ hydra investigate show-logs aa3c6baf-5ee1-4aab-b552-e3975635ecb6` ## Logs: *No logs captured during this run.* [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-50gb-3days-test/183/)
In your case of the 50gb, it's caused by the SCT is getting out of disk space, and out of memory, hence connection to it is lost. there are multiple issues opened regarding that issue.
It happened again:
18:31:29 command [ssh -l ubuntu -- 3.94.31.185 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=client_loop: send disconnect: Broken pipe
Kernel Version: 5.15.0-1039-aws
Scylla version (or git commit hash): 2022.2.11-20230705.27d29485de90
with build-id f467a0ad8869d61384d8bbc8f20e4fb8fd281f4b
Cluster size: 3 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-035cd5dfc7ca3cceb
(aws: undefined_region)
Test: longevity-5gb-1h-CorruptThenRepairMonkey-aws-test_14676-recreation
Test id: 0e492691-a0a3-478b-89b8-75af0d5eea1b
Test name: scylla-staging/Shlomo/longevity-5gb-1h-CorruptThenRepairMonkey-aws-test_14676-recreation
Test config file(s):
this one is from a year ago,
I suspect it was issue cause by sct running regexes on very long log lines, we fixed that issue, and such crashes of sct-runners weren't seen since then.
Prerequisites
Versions
Logs
Description
The job in question, https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/Shlomo/job/longevity-multi-keyspaces-60h-test/5/ failed after 34 hours for no reason related to the sct run itself. By that I mean that there was no critical or error event that would cause the run to fail, it was perfectly healthy. The run, however, failed due to jenkins losing its connection to the runner?
(44.192.100.73 is the runner's IP)