scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
52 stars 87 forks source link

SCT should not generate an error event for coredumpctl metadata but rather only for the scylla decoded backtrace #7590

Closed yarongilor closed 3 days ago

yarongilor commented 1 month ago

As can be found in: https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns%5B%5D=7872957f-b3ab-492a-b193-a6b3c4284aa5

an error event is reported like:

CoreDumpEvent
ERROR
no nemesis
2024-06-01 05:37:05.728
Received: 2024-06-01 05:12:43.000
one-time
Node longevity-tls-50gb-3d-master-db-node-7872957f-3 [52.211.141.233 | 10.4.8.49]
2024-06-01 05:37:05.728 <2024-06-01 05:12:43.000>: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=773c9f7a-3e27-4c42-a127-db3f79ef3107 node=Node longevity-tls-50gb-3d-master-db-node-7872957f-3 [52.211.141.233 | 10.4.8.49]
corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla.112.d0159003556044638dcc3b4d53571938.7473.1717218763000000/core.scylla.112.d0159003556044638dcc3b4d53571938.7473.1717218763000000.gz
backtrace=           PID: 7473 (scylla)
UID: 112 (scylla)
GID: 119 (scylla)
Signal: 6 (ABRT)
Timestamp: Sat 2024-06-01 05:12:43 UTC (2min 14s ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 25 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 1-7,9-15 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /scylla.slice/scylla-server.slice/scylla-server.service
Unit: scylla-server.service
Slice: scylla-server.slice
Boot ID: d0159003556044638dcc3b4d53571938
Machine ID: 4d6728f7a7bf40e4a36b386a211c4f98
Hostname: longevity-tls-50gb-3d-master-db-node-7872957f-3
Storage: /var/lib/systemd/coredump/core.scylla.112.d0159003556044638dcc3b4d53571938.7473.1717218763000000 (present)
Disk Size: 114.5G
Message: Process 7473 (scylla) of user 112 dumped core.
Stack trace of thread 7481:
#0  0x00007f054c45a884 __pthread_kill_implementation (libc.so.6 + 0x8e884)
#1  0x00007f054c409afe raise (libc.so.6 + 0x3dafe)
#2  0x00007f054c3f287f abort (libc.so.6 + 0x2687f)

image

soyacz commented 1 month ago

In my opinion and recent feedback from developers I think we should just put this details to text file and send to s3 as easy to download link. SCT event should contain it so we can copy-paste event to issues. Still people write these backtraces to issues and make them harder to scroll. The same applies to Argus.

roydahan commented 3 days ago

I don't understand why you think that the coredump information shouldn't be an error event. This is important information and should be always included in reported issue.

Regarding the RFE from @soyacz to make it easier for copy-paste, it can be additional to the existing inforamtion. If requested, it should be opened in new issue.

yarongilor commented 3 days ago

I don't understand why you think that the coredump information shouldn't be an error event. This is important information and should be always included in reported issue.

Regarding the RFE from @soyacz to make it easier for copy-paste, it can be additional to the existing inforamtion. If requested, it should be opened in new issue.

@roydahan , @soyacz , were you ever asked for coredump metadata by a core developer? i was never asked for it. i was only asked for the decoded backtrace. when i added this coredump metadata to an issue, they usually complain it's useless.

roydahan commented 3 days ago

It's not useless, but of course it's not enough.