SCT raised a CoreDumpEvent, but the coredump itself was not found anywhere

fgelcer commented 3 years ago

Installation details Kernel version: 5.4.0-1035-aws Scylla version (or git commit hash): 4.6.dev-0.20210613.846f0bd16e4 with build-id 77ebbc518e4fd9560d3993067706780031d4ee26 Cluster size: 4 nodes (i3.4xlarge) Scylla running with shards number (live nodes): longevity-200gb-48h-verify-limited--db-node-eadde21f-1 (13.51.156.141 | 10.0.1.212): 14 shards longevity-200gb-48h-verify-limited--db-node-eadde21f-2 (13.51.159.161 | 10.0.3.56): 14 shards longevity-200gb-48h-verify-limited--db-node-eadde21f-5 (13.49.65.153 | 10.0.1.23): 14 shards longevity-200gb-48h-verify-limited--db-node-eadde21f-7 (13.51.48.88 | 10.0.1.0): 14 shards Scylla running with shards number (terminated nodes): longevity-200gb-48h-verify-limited--db-node-eadde21f-4 (13.48.24.246 | 10.0.1.125): 14 shards longevity-200gb-48h-verify-limited--db-node-eadde21f-3 (13.51.55.116 | 10.0.3.249): 14 shards longevity-200gb-48h-verify-limited--db-node-eadde21f-6 (13.53.182.66 | 10.0.1.68): 14 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0efd9637b9940c9b5 (aws: eu-north-1)

Test: longevity-200gb-48h Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

longevity-200GB-48h-verifier-LimitedMonkey-tls.yaml

Issue description

there is an event:

2021-06-19 08:05:29.654: (CoreDumpEvent Severity.ERROR) period_type=not-set event_id=3a31784f-c58c-4340-85b4-73e75293ae8b node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-2 [13.51.159.161 | 10.0.3.56] (seed: False)

there is something in the node's coredumps.info:

           PID: 96645 (scylla)
           UID: 113 (scylla)
           GID: 119 (scylla)
        Signal: 11 (SEGV)
     Timestamp: Sat 2021-06-19 08:05:03 UTC (20h ago)
  Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 100 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 1-7,9-15 --lock-memory=1
    Executable: /opt/scylladb/libexec/scylla
 Control Group: /scylla.slice/scylla-server.slice/scylla-server.service
          Unit: scylla-server.service
         Slice: scylla-server.slice
       Boot ID: 1836c9b98e11461094c09b1fe93491d2
    Machine ID: 0d278baa2bee456599166e7a3d1d8f38
      Hostname: longevity-200gb-48h-verify-limited--db-node-eadde21f-2
       Storage: none
       Message: Process 96645 (scylla) of user 113 dumped core.

but where is the core? why did it happen? i see that the node's log watch moved back in time:

2021-06-19T08:03:55+00:00  longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO    | sshd[286855]: pam_unix(sshd:session): session opened for user scyllaadm by (uid=0)
2021-06-19T07:55:32+00:00  longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO    | systemd[1]: session-3795.scope: Succeeded.

then back:

2021-06-19T07:57:33+00:00  longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO    | scylla:  [shard 8] compaction - [Compact keyspace1.standard1 0128fe40-d0d4-11eb-9874-457e9b14cbb1] Compacting [/var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-349770-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348216-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348118-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348230-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348160-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348202-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348132-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348104-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348188-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348174-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348146-big-Data.db:level=1:origin=compaction, ]
2021-06-19T08:05:41+00:00  longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO    | systemd-logind[674]: Removed session 3889.

and again forward:

2021-06-19T08:21:56+00:00  longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO    | systemd[1]: Started Session 4047 of user scyllaadm.
2021-06-19T08:03:55+00:00  longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO    | systemd-logind[674]: New session 3872 of user scyllaadm.

Restore Monitor Stack command: $ hydra investigate show-monitor eadde21f-ad93-476f-a546-842a4fea2708 Show all stored logs command: $ hydra investigate show-logs eadde21f-ad93-476f-a546-842a4fea2708

Test id: eadde21f-ad93-476f-a546-842a4fea2708

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_042523/grafana-screenshot-longevity-200gb-48h-scylla-per-server-metrics-nemesis-20210620_042932-longevity-200gb-48h-verify-limited--monitor-node-eadde21f-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_043349/db-cluster-eadde21f.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_043349/loader-set-eadde21f.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_043349/monitor-set-eadde21f.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_043349/sct-runner-eadde21f.zip

Jenkins job URL

fgelcer commented 3 years ago

it happened during Enospc nemesis:

2021-06-19 08:04:45.492: (DisruptionEvent Severity.NORMAL) period_type=not-set event_id=461b9da3-238b-4fe2-ac36-8d1d9a2ecc68: type=Enospc subtype=start target_node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-2 [13.51.159.161 | 10.0.3.56] (seed: False) duration=None
2021-06-19 08:05:29.654: (CoreDumpEvent Severity.ERROR) period_type=not-set event_id=3a31784f-c58c-4340-85b4-73e75293ae8b node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-2 [13.51.159.161 | 10.0.3.56] (seed: False)
2021-06-19 08:07:17.676: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=50170968-a93c-4f54-880a-a5c36244b0ff: alert_name=InstanceDown type=start start=2021-06-19T08:07:09.591Z end=2021-06-19T08:11:09.591Z description=10.0.3.56 has been down for more than 30 seconds. updated=2021-06-19T08:07:09.647Z state=active fingerprint=6aa989b420871186 labels={'alertname': 'InstanceDown', 'instance': '10.0.3.56', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '2'}
2021-06-19 08:07:17.677: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=24103eb8-cea3-41cc-aeb2-392142cdd329: alert_name=DiskFull type=start start=2021-06-19T08:07:09.591Z end=2021-06-19T08:11:09.591Z description=10.0.3.56 has less than 1% free disk space. updated=2021-06-19T08:07:09.651Z state=active fingerprint=d1f83e9ba67c51f7 labels={'alertname': 'DiskFull', 'device': '/dev/md0', 'fstype': 'xfs', 'instance': '10.0.3.56', 'job': 'node_exporter', 'monitor': 'scylla-monitor', 'mountpoint': '/var/lib/scylla', 'severity': '4'}
2021-06-19 08:08:30.866: (FullScanEvent Severity.WARNING) period_type=not-set event_id=aab6773a-9c85-4ad2-9e00-f6b26041a268: type=finish select_from=keyspace1.standard1 on db_node=10.0.1.212 message=Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for keyspace1.standard1 - received only 0 responses from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0}
2021-06-18 22:22:19.000: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=3d8b19f1-02b3-42e8-828d-86bc7c5dc737: type=CLIENT_DISCONNECT regex=\!INFO.*cql_server - exception while processing connection:.* line_number=293999 node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-1 [13.51.156.141 | 10.0.1.212] (seed: True)
2021-06-18T22:22:19+00:00  longevity-200gb-48h-verify-limited--db-node-eadde21f-1 !INFO    | scylla:  [shard 4] cql_server - exception while processing connection: std::system_error (error GnuTLS:-10, The specified session has been invalidated for some reason.)
2021-06-19 08:11:13.645: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=95910ea2-d816-482e-ae23-383c2cdc4332: alert_name=InstanceDown type=end start=2021-06-19T08:07:09.591Z end=2021-06-19T08:13:09.591Z description=10.0.3.56 has been down for more than 30 seconds. updated=2021-06-19T08:09:09.644Z state=active fingerprint=6aa989b420871186 labels={'alertname': 'InstanceDown', 'instance': '10.0.3.56', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '2'}
2021-06-19 08:11:13.646: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=5383e5bb-57b3-4244-9372-7b7e82c314cf: alert_name=DiskFull type=end start=2021-06-19T08:07:09.591Z end=2021-06-19T08:13:09.591Z description=10.0.3.56 has less than 1% free disk space. updated=2021-06-19T08:09:09.648Z state=active fingerprint=d1f83e9ba67c51f7 labels={'alertname': 'DiskFull', 'device': '/dev/md0', 'fstype': 'xfs', 'instance': '10.0.3.56', 'job': 'node_exporter', 'monitor': 'scylla-monitor', 'mountpoint': '/var/lib/scylla', 'severity': '4'}
2021-06-19 08:12:16.738: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=0ab516d2-0032-4f67-9274-8a6ea79a6ad6: alert_name=restart type=start start=2021-06-19T08:12:09.591Z end=2021-06-19T08:16:09.591Z description=Node restarted updated=2021-06-19T08:12:09.753Z state=active fingerprint=0002ec29f7b65adf labels={'alertname': 'restart', 'instance': '10.0.3.56', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '1', 'shard': '0'}
2021-06-19 08:13:31.718: (FullScanEvent Severity.NORMAL) period_type=not-set event_id=f24b7900-400e-4b3e-a355-6449e79f8778: type=start select_from=keyspace1.standard1 on db_node=10.0.1.212
2021-06-19 08:13:41.986: (DisruptionEvent Severity.NORMAL) period_type=not-set event_id=c87a0548-8000-4095-8eaa-f5c9873a9cbb: type=Enospc subtype=end target_node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-2 [13.51.159.161 | 10.0.3.56] (seed: False) duration=535

so probably that old issue we never been able to report the core, as we don't have enough space to dump it to the disk, while we test enospc

fgelcer commented 3 years ago

same, or very similar happened here too:

Installation details Kernel version: 5.4.0-1035-aws Scylla version (or git commit hash): 4.6.dev-0.20210613.846f0bd16e4 with build-id 77ebbc518e4fd9560d3993067706780031d4ee26 Cluster size: 6 nodes (i3.4xlarge) Scylla running with shards number (live nodes): longevity-tls-50gb-3d-master-db-node-b2ffd3dd-1 (13.51.156.177 | 10.0.3.204): 14 shards longevity-tls-50gb-3d-master-db-node-b2ffd3dd-2 (13.51.241.70 | 10.0.0.131): 14 shards longevity-tls-50gb-3d-master-db-node-b2ffd3dd-3 (13.51.233.160 | 10.0.0.167): 14 shards longevity-tls-50gb-3d-master-db-node-b2ffd3dd-4 (13.51.64.95 | 10.0.0.69): 14 shards longevity-tls-50gb-3d-master-db-node-b2ffd3dd-5 (13.48.71.164 | 10.0.0.207): 14 shards longevity-tls-50gb-3d-master-db-node-b2ffd3dd-6 (13.53.38.115 | 10.0.1.111): 14 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0efd9637b9940c9b5 (aws: eu-north-1)

Test: longevity-50gb-3days Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Issue description

====================================

PUT ISSUE DESCRIPTION HERE

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor b2ffd3dd-f590-43f8-8656-c2b87081b576 Show all stored logs command: $ hydra investigate show-logs b2ffd3dd-f590-43f8-8656-c2b87081b576

Test id: b2ffd3dd-f590-43f8-8656-c2b87081b576

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_042314/grafana-screenshot-longevity-50gb-3days-scylla-per-server-metrics-nemesis-20210618_042724-longevity-tls-50gb-3d-master-monitor-node-b2ffd3dd-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_043147/db-cluster-b2ffd3dd.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_043147/loader-set-b2ffd3dd.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_043147/monitor-set-b2ffd3dd.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_043147/sct-runner-b2ffd3dd.zip

Jenkins job URL

dkropachev commented 2 years ago

That is expected behavior to do not see core during enospc.

fruch commented 2 years ago

Don't know if to call expected behavior, it a known issue in SCT, that never got fixed

fgelcer commented 2 years ago

the only way to get it dumped is to have an external disk used only for this function, so once there is a core dump, it will have a free disk to do it... but it has other implications and challenges... so i agree with @fruch , it is a known issue more than expected behavior..

dkropachev commented 2 years ago

What we can do is to stream cores to s3 via curl and pick them up form there

fgelcer commented 2 years ago

What we can do is to stream cores to s3 via curl and pick them up form there

is the OS able to stream to S3 instead of dumping it to disk?

fruch commented 2 years ago

What we can do is to stream cores to s3 via curl and pick them up form there

something like this: https://gist.github.com/hashbrowncipher/57dd3a52103cae02290ac65fae9f3422

juliayakovlev commented 2 years ago

Issue description

I also faced this problem. Core dump was uploaded and shared download_instruction:

< t:2022-05-26 13:36:43,019 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo curl --request PUT --upload-file '/var/lib/systemd/coredump/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz' 'upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz'" finished with status 0
< t:2022-05-26 13:36:43,019 f:coredump.py     l:212  c:sdcm.cluster_aws     p:INFO  > Node longevity-mv-si-4d-2022-1-db-node-81fb644c-8 [16.171.47.121 | 10.0.3.115] (seed: False): CoredumpExportSystemdThread: You can download it by https://storage.cloud.google.com/upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz (available for ScyllaDB employee)
< t:2022-05-26 13:36:43,022 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz
< t:2022-05-26 13:36:43,022 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >        Storage: /var/lib/systemd/coredump/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000
< t:2022-05-26 13:36:43,022 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz .
< t:2022-05-26 13:36:43,022 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > gunzip /var/lib/systemd/coredump/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz

But developers complained that they can't find it. Actually it can't be found:

juliayakovlev@juliayakovlev-Latitude-5421 ~/Downloads $ gsutil cp gs://upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz .
CommandException: No URLs matched: gs://upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz

And when I search in manually in GCE, it's not found. And with this URL it's not found: https://storage.cloud.google.com/upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz

No such object: upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz

Installation details

Kernel Version: 5.13.0-1022-aws Scylla version (or git commit hash): 2022.1~rc5-20220515.6a1e89fbb with build-id 5cecadda59974548befb4305363bf374631fc3e1 Cluster size: 5 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-mv-si-4d-2022-1-db-node-81fb644c-9 (16.171.58.120 | 10.0.2.39) (shards: 14)
longevity-mv-si-4d-2022-1-db-node-81fb644c-8 (16.171.47.121 | 10.0.3.115) (shards: 14)
longevity-mv-si-4d-2022-1-db-node-81fb644c-7 (13.48.104.138 | 10.0.2.234) (shards: 14)
longevity-mv-si-4d-2022-1-db-node-81fb644c-6 (13.51.6.132 | 10.0.3.179) (shards: 14)
longevity-mv-si-4d-2022-1-db-node-81fb644c-5 (13.51.69.58 | 10.0.3.204) (shards: 14)
longevity-mv-si-4d-2022-1-db-node-81fb644c-4 (13.51.170.112 | 10.0.3.196) (shards: 14)
longevity-mv-si-4d-2022-1-db-node-81fb644c-3 (16.170.227.40 | 10.0.2.134) (shards: 14)
longevity-mv-si-4d-2022-1-db-node-81fb644c-2 (13.49.224.82 | 10.0.3.90) (shards: 14)
longevity-mv-si-4d-2022-1-db-node-81fb644c-1 (16.16.65.106 | 10.0.0.13) (shards: 14)

OS / Image: ami-0838dc54c055ad05a (aws: eu-north-1)

Test: longevity-mv-si-4days-test Test id: 81fb644c-b1ac-42de-bd54-0ae2c4889180 Test name: enterprise-2022.1/longevity/longevity-mv-si-4days-test Test config file(s):

longevity-mv-si-4days.yaml
Restore Monitor Stack command: $ hydra investigate show-monitor 81fb644c-b1ac-42de-bd54-0ae2c4889180
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 81fb644c-b1ac-42de-bd54-0ae2c4889180

Logs:

db-cluster-81fb644c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/81fb644c-b1ac-42de-bd54-0ae2c4889180/20220526_143745/db-cluster-81fb644c.tar.gz
monitor-set-81fb644c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/81fb644c-b1ac-42de-bd54-0ae2c4889180/20220526_143745/monitor-set-81fb644c.tar.gz
loader-set-81fb644c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/81fb644c-b1ac-42de-bd54-0ae2c4889180/20220526_143745/loader-set-81fb644c.tar.gz
sct-runner-81fb644c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/81fb644c-b1ac-42de-bd54-0ae2c4889180/20220526_143745/sct-runner-81fb644c.tar.gz

Jenkins job URL

fruch commented 2 years ago

@juliayakovlev

seems like that node run of disk space, and the files are getting cleared before we have chance to upload them:

9440150:< t:2022-05-26 13:34:28,424 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 022-05-26T13:34:28+00:00 longevity-mv-si-4d-2022-1-db-node-81fb644c-8 !    INFO | Removed old coredump core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.
9456170:< t:2022-05-26 13:39:15,844 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-05-26T13:39:15+00:00 longevity-mv-si-4d-2022-1-db-node-81fb644c-8 !    INFO | Removed old coredump core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz.

juliayakovlev commented 2 years ago

@juliayakovlev

seems like that node run of disk space, and the files are getting cleared before we have chance to upload them:

9440150:< t:2022-05-26 13:34:28,424 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 022-05-26T13:34:28+00:00 longevity-mv-si-4d-2022-1-db-node-81fb644c-8 !    INFO | Removed old coredump core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.
9456170:< t:2022-05-26 13:39:15,844 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-05-26T13:39:15+00:00 longevity-mv-si-4d-2022-1-db-node-81fb644c-8 !    INFO | Removed old coredump core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz.

@fruch but uploading finished. Maybe we need to copy coredump immediatelly to runner and then upload. I understand that it's not so good solution, but it happened again and I can't let to developers all needed data

fruch commented 2 years ago

@juliayakovlev upload failed:

9446413:< t:2022-05-26 13:36:42,519 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > <h2>The requested URL <code>/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.16535
71792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz</code> was not found on this server.</h2>
9446414-< t:2022-05-26 13:36:42,519 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > <h2></h2>
9446415-< t:2022-05-26 13:36:42,519 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > </body></html>

that the return status code was 0, doesn't mean the upload succeeded.

juliayakovlev commented 2 years ago

@juliayakovlev upload failed:

9446413:< t:2022-05-26 13:36:42,519 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > <h2>The requested URL <code>/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.16535
71792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz</code> was not found on this server.</h2>
9446414-< t:2022-05-26 13:36:42,519 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > <h2></h2>
9446415-< t:2022-05-26 13:36:42,519 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > </body></html>

that the return status code was 0, doesn't mean the upload succeeded.

@fruch we need to handle it and raise exception

fruch commented 2 years ago

Julia, I've found the issue, it's not related to what's describe in this issue

seem like we need to change the url we are using

and yes we should fail on curl failures (we had a retry, long time ago)

roydahan commented 2 years ago

@fruch do you know who changed that and why?

roydahan commented 2 years ago

@fruch a I commented on the PR, I don't see any reference for such a change in scylla-docs. Moreover, we had several coredumps in the last few weeks and we didn't have that issue.