scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

current root disk (35G) isn't enough in extreme situations (dense errors) #3220

Closed amoskong closed 1 year ago

amoskong commented 3 years ago

Prerequisites

Versions

Logs

Description

Recent two longevity 1tb jobs all failed in collecting logs for ENOSPC, also failed to send email for ENOSPC.

https://jenkins.scylladb.com/view/scylla-4.4/job/scylla-4.4/job/longevity/job/longevity-1tb-7days-test/2 https://jenkins.scylladb.com/view/scylla-4.4/job/scylla-4.4/job/longevity/job/longevity-1tb-7days-test/3

09:26:29  Created directory to storing collected logs: /home/ubuntu/sct-results/20210202-114434-564842/collected_logs
09:26:29  Start collect logs for cluster db-cluster
09:27:34  Nodes list ['longevity-tls-1tb-7d-4-4-db-node-cde48343-3', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-6', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-7', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-9', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-12', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-10', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-11', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-14', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-15', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-13', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-17', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-16', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-18', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-19', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-21', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-20', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-22', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-23', 'longevity-tls-1tb-7d-4-4-db-node-cde48343-24']
09:27:34  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-3
09:27:34  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-6
09:27:34  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-7
09:27:34  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-9
09:27:34  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-12
09:27:34  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-10
09:27:34  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-11
09:27:34  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-14
09:27:34  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-15
09:27:38  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-13
09:27:38  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-17
09:27:38  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-16
09:27:38  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-18
09:27:38  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-19
09:27:38  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-21
09:27:41  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-20
09:27:41  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-22
09:27:41  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-23
09:27:42  Collecting logs on host: longevity-tls-1tb-7d-4-4-db-node-cde48343-24
09:32:48  Uploading '/home/ubuntu/sct-results/20210202-114434-564842/collected_logs/20210210_012624/db-cluster-cde48343.zip' to https://cloudius-jenkins-test.s3.amazonaws.com/cde48343-6692-44ff-83af-24cd11487623/20210210_012624/db-cluster-cde48343.zip
09:33:28  Uploaded to https://cloudius-jenkins-test.s3.amazonaws.com/cde48343-6692-44ff-83af-24cd11487623/20210210_012624/db-cluster-cde48343.zip
09:33:28  Set public read access
09:33:29  collected data for db-cluster
09:33:29  https://cloudius-jenkins-test.s3.amazonaws.com/cde48343-6692-44ff-83af-24cd11487623/20210210_012624/db-cluster-cde48343.zip
09:33:29  
09:33:29  Start collect logs for cluster monitor-set
09:33:29  Nodes list ['longevity-tls-1tb-7d-4-4-monitor-node-cde48343-1']
09:33:29  Collecting logs on host: longevity-tls-1tb-7d-4-4-monitor-node-cde48343-1
09:33:58  /home/centos/sct-monitoring/scylla-monitoring-data/snapshots/20210210T013354Z-78629a0f5f3f164f
09:39:05  /home/centos/sct-monitoring/scylla-monitoring-data/snapshots/prometheus_data_20210210_012624.tar.gz
09:40:35  scylla-monitoring-src, branch-3.6, 4.4
09:40:35  Get screenshot for url http://13.48.70.60:3000/d/overview-master/scylla-overview?from=1612898784000&to=now, save to /home/ubuntu/sct-results/20210202-114434-564842/collected_logs/20210210_012624/monitor-set-cde48343/longevity-tls-1tb-7d-4-4-monitor-node-cde48343-1/grafana-screenshot-overview-20210210_014035-longevity-tls-1tb-7d-4-4-monitor-node-cde48343-1.png
09:40:35  Get a screenshot of http://13.48.70.60:3000/d/overview-master/scylla-overview?from=1612898784000&to=now
09:40:35  ssh-agent started successfully:
09:40:35        SSH_AUTH_SOCK=/tmp/ssh-zvJhgiBwUqLB/agent.23957
09:40:35        SSH_AGENT_PID=23960
09:40:35  /usr/local/lib/python3.9/site-packages/paramiko/client.py:835: UserWarning: Unknown ssh-ed25519 host key for 13.48.70.60: b'ba4ebb0813806ed6499002d02858dc01'
09:40:35    warnings.warn(
09:40:38  Container <abbec8bd47 longevity-tls-1tb-7d-4-4-monitor-node-cde48343-1-webdriver> started.
09:40:38  Container <ae0143bae4 longevity-tls-1tb-7d-4-4-monitor-node-cde48343-1-13.48.70.60-autossh-web_driver> started.
09:44:12  Get screenshot for url http://13.48.70.60:3000/dashboard/db/longevity-1tb-7days-test-scylla-per-server-metrics-nemesis-master?from=1612898784000&to=now, save to /home/ubuntu/sct-results/20210202-114434-564842/collected_logs/20210210_012624/monitor-set-cde48343/longevity-tls-1tb-7d-4-4-monitor-node-cde48343-1/grafana-screenshot-longevity-1tb-7days-test-scylla-per-server-metrics-nemesis-20210210_014400-longevity-tls-1tb-7d-4-4-monitor-node-cde48343-1.png
09:44:12  Get a screenshot of http://13.48.70.60:3000/dashboard/db/longevity-1tb-7days-test-scylla-per-server-metrics-nemesis-master?from=1612898784000&to=now
09:47:00  Grafana - browser quit
09:47:01  Get snapshot link for url http://13.48.70.60:3000/d/overview-master/scylla-overview?from=1612898784000&to=now
09:48:48  Error taking monitor snapshot: Message: 
09:48:48  , traceback: Traceback (most recent call last):
09:48:48    File "/home/ubuntu/scylla-cluster-tests/sdcm/logcollector.py", line 590, in get_grafana_snapshot
09:48:48      snapshots.append(self._get_shared_snapshot_link(self.remote_browser.browser, grafana_url))
09:48:48    File "/home/ubuntu/scylla-cluster-tests/sdcm/logcollector.py", line 546, in _get_shared_snapshot_link
09:48:48      self.scrolldown_dashboards_view(remote_browser)
09:48:48    File "/home/ubuntu/scylla-cluster-tests/sdcm/logcollector.py", line 521, in scrolldown_dashboards_view
09:48:48      WebDriverWait(remote_browser, 60).until(EC.visibility_of_element_located(self.snapshot_scroll_ready_locator))
09:48:48    File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/support/wait.py", line 80, in until
09:48:48      raise TimeoutException(message, screen, stacktrace)
09:48:48  selenium.common.exceptions.TimeoutException: Message: 
09:48:48  
09:48:48  
09:48:48  Grafana - browser quit
09:48:48  Uploading '/home/ubuntu/sct-results/20210202-114434-564842/collected_logs/20210210_012624/monitor-set-cde48343.zip' to https://cloudius-jenkins-test.s3.amazonaws.com/cde48343-6692-44ff-83af-24cd11487623/20210210_012624/monitor-set-cde48343.zip
09:49:00  Uploaded to https://cloudius-jenkins-test.s3.amazonaws.com/cde48343-6692-44ff-83af-24cd11487623/20210210_012624/monitor-set-cde48343.zip
09:49:00  Set public read access
09:49:01  collected data for monitor-set
09:49:01  https://cloudius-jenkins-test.s3.amazonaws.com/cde48343-6692-44ff-83af-24cd11487623/20210210_012624/monitor-set-cde48343.zip
09:49:01  
09:49:01  Start collect logs for cluster loader-set
09:49:49  Nodes list ['longevity-tls-1tb-7d-4-4-loader-node-cde48343-2', 'longevity-tls-1tb-7d-4-4-loader-node-cde48343-1']
09:49:49  Collecting logs on host: longevity-tls-1tb-7d-4-4-loader-node-cde48343-2
09:49:49  Collecting logs on host: longevity-tls-1tb-7d-4-4-loader-node-cde48343-1
09:50:17  Uploading '/home/ubuntu/sct-results/20210202-114434-564842/collected_logs/20210210_012624/loader-set-cde48343.zip' to https://cloudius-jenkins-test.s3.amazonaws.com/cde48343-6692-44ff-83af-24cd11487623/20210210_012624/loader-set-cde48343.zip
09:50:44  Uploaded to https://cloudius-jenkins-test.s3.amazonaws.com/cde48343-6692-44ff-83af-24cd11487623/20210210_012624/loader-set-cde48343.zip
09:50:44  Set public read access
09:50:44  collected data for loader-set
09:50:44  https://cloudius-jenkins-test.s3.amazonaws.com/cde48343-6692-44ff-83af-24cd11487623/20210210_012624/loader-set-cde48343.zip
09:50:44  
09:50:44  Start collect logs for cluster kubernetes
09:50:44  Nodes list []
09:50:44  No nodes found for kubernetes cluster. Logs will not be collected
09:50:44  There are no logs collected for kubernetes
09:50:44  Start collect logs for cluster sct-runner
09:53:11  Traceback (most recent call last):
09:53:11    File "/home/ubuntu/scylla-cluster-tests/./sct.py", line 872, in <module>
09:53:11      cli()
09:53:11    File "/usr/local/lib/python3.9/site-packages/click/core.py", line 764, in __call__
09:53:11      return self.main(*args, **kwargs)
09:53:11    File "/usr/local/lib/python3.9/site-packages/click/core.py", line 717, in main
09:53:11      rv = self.invoke(ctx)
09:53:11    File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1137, in invoke
09:53:11      return _process_result(sub_ctx.command.invoke(sub_ctx))
09:53:11    File "/usr/local/lib/python3.9/site-packages/click/core.py", line 956, in invoke
09:53:11      return ctx.invoke(self.callback, **ctx.params)
09:53:11    File "/usr/local/lib/python3.9/site-packages/click/core.py", line 555, in invoke
09:53:11      return callback(*args, **kwargs)
09:53:11    File "/home/ubuntu/scylla-cluster-tests/./sct.py", line 662, in collect_logs
09:53:11      collected_logs = collector.run()
09:53:11    File "/home/ubuntu/scylla-cluster-tests/sdcm/logcollector.py", line 1217, in run
09:53:11      if result := log_collector.collect_logs(local_search_path=local_dir_with_logs):
09:53:11    File "/home/ubuntu/scylla-cluster-tests/sdcm/logcollector.py", line 962, in collect_logs
09:53:11      ent.collect(None, self.local_dir, None, local_search_path=local_search_path)
09:53:11    File "/home/ubuntu/scylla-cluster-tests/sdcm/logcollector.py", line 213, in collect
09:53:11      shutil.copy(src=logfile, dst=local_dst)
09:53:11    File "/usr/local/lib/python3.9/shutil.py", line 418, in copy
09:53:11      copyfile(src, dst, follow_symlinks=follow_symlinks)
09:53:11    File "/usr/local/lib/python3.9/shutil.py", line 275, in copyfile
09:53:11      _fastcopy_sendfile(fsrc, fdst)
09:53:11    File "/usr/local/lib/python3.9/shutil.py", line 166, in _fastcopy_sendfile
09:53:11      raise err from None
09:53:11    File "/usr/local/lib/python3.9/shutil.py", line 152, in _fastcopy_sendfile
09:53:11      sent = os.sendfile(outfd, infd, offset, blocksize)
09:53:11  OSError: [Errno 28] No space left on device: '/home/ubuntu/sct-results/20210202-114434-564842/sct.log' -> '/home/ubuntu/sct-results/20210202-114434-564842/collected_logs/20210210_012624/sct-runner-cde48343/sct.log'
09:53:11  Cleaning SSH agent
09:53:11  Agent pid 3382269 killed
09:53:41  + ./docker/env/hydra.sh --execute-on-runner 13.53.36.158 clean-resources --post-behavior --test-id cde48343-6692-44ff-83af-24cd11487623
....
09:53:46  Going to run './sct.py clean-resources --post-behavior --test-id cde48343-6692-44ff-83af-24cd11487623'...
09:53:47  docker: Error response from daemon: mkdir /var/lib/docker/overlay2/1fcee28826943015b4307f6d3f19020f01470382b6d6eba09501dc6633c5bef5-init: no space left on device.
09:53:51  + ./docker/env/hydra.sh --execute-on-runner 13.53.36.158 send-email --test-status ABORTED --start-time 1612266101 --email-recipients qa@scylladb.com
...
09:53:55  Going to run './sct.py send-email --test-status ABORTED --start-time 1612266101 --email-recipients qa@scylladb.com'...
09:53:56  docker: Error response from daemon: mkdir /var/lib/docker/overlay2/47ccf63af4f120fddd65444a173d6a13fe5fa496d571e7586e3627061f0ab9a6-init: no space left on device.

Steps to Reproduce

  1. [First Step]
  2. [Second Step]
  3. [and so on...]

Expected behavior: [What you expected to happen]

Actual behavior: [What actually happened]

amoskong commented 3 years ago

The tmp solution:

amoskong commented 3 years ago

There is a fix: https://github.com/scylladb/scylla-cluster-tests/pull/3193

/Cc @roydahan @bentsi

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 2 years with no activity. Remove stale label or comment or this will be closed in 2 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 2 days with no activity.