scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 94 forks source link

Timestamps in Scylla server logs don't make sense, and messages are dropped. #6682

Open kbr-scylla opened 1 year ago

kbr-scylla commented 1 year ago

Debugging Scylla from SCT run logs is broken because:

TL;DR can I have logs gathered locally from Scylla process locally on the Scylla machine with timestamps assigned by Scylla using Scylla machine clock?

So it looks like messages are gathered by SCT runner machine remotely through syslog-ng. The runner connects to Scylla machine and then Scylla sends it logs directly to a socket opened by syslog-ng (?) and it looks like timestamps are assigned on the runner machine, by syslog-ng, not by Scylla.

This causes logging to be broken and useless. (Ok, it's not completely useless, but much less useful than it could be.)

For example consider these logs: https://cloudius-jenkins-test.s3.amazonaws.com/9f0e827b-8a9a-4767-b4b4-0582d3f9415f/20231007_113814/db-cluster-9f0e827b.tar.gz

Here's one event:

2023-10-07T10:46:17.572+00:00 longevity-ParallelAddDropColumnMonk-db-node-9f0e827b-2    !DEBUG | scylla[5329]:  [shard 0:stre] raft - [44d4489b-a2d3-45c8-8164-fa5377cffec1] applier fiber waiting for batch, queue size 0
2023-10-07T10:46:17.572+00:00 longevity-ParallelAddDropColumnMonk-db-node-9f0e827b-2      !ERR | syslog-ng[4612]: EOF occurred while idle; fd='37'
2023-10-07T10:46:17.572+00:00 longevity-ParallelAddDropColumnMonk-db-node-9f0e827b-2   !NOTICE | syslog-ng[4612]: Syslog connection closed; fd='37', server='AF_INET(10.12.3.178:32768)', time_reopen='60'
2023-10-07T10:46:17.572+00:00 longevity-ParallelAddDropColumnMonk-db-node-9f0e827b-2    !DEBUG | scylla[5329]:  [shard 0:goss] migration_manager - Not pulling schema because schema pulls were disabled due to Raft.

syslog-ng temporarily lost connection to Scylla machine. Then it reconnected, but we lost all messages from the period when the connection was lost.

Also, the timestamps don't make sense. Last timestamp before connection drop: 10:46:17.572, first timestamp after connection was restored: 10:46:17.572. Those are not real Scylla timestamps. These are timestamps from the SCT runner machine, not from Scylla. I determined by looking at other Scylla events that about 30 seconds have passed between the two events on Scylla machine.

I need timestamps assigned by Scylla clock, on the Scylla machine from the moment the Scylla log message is logged, not by syslog-ng from the moment the log message is gathered. The latter doesn't make any sense at all.

another event causing log message loss:

2023-10-07T10:46:17.578+00:00 longevity-ParallelAddDropColumnMonk-db-node-9f0e827b-2      !ERR | syslog-ng[4612]: Destination reliable queue full, dropping message; filename='/tmp/syslog-ng-00000.rqf', queue_len='909', mem_buf_size='100000', disk_buf_size='2000000', persist_name='afsocket_dd_qfile(stream,10.12.3.178:32768)'

I need all log messages. I cannot afford losing log messages, because it makes debugging borderline impossible. Log messages need to be saved by Scylla process locally on the Scylla machine disk. Then this log file needs to be collected as an artifact. Or it can be collected by syslog-ng, but syslog-ng has to start from the point when it last lost connection, not from current moment in time. And the timestamps need to be assigned by Scylla.

In fact, I think recall that in the past, SCT artifacts contained original journald Scylla logs, no? But nowadays we no longer have them. Maybe I recall wrong. But this is what I need -- original Scylla logs, with timestamps assigned by Scylla using Scylla machine clock at the moment of logging the event, not some arbitrary time later when the log message is sent to a remote machine.

cc @scylladb/qa-maintainers @temichus @soyacz @kostja @avikivity

kbr-scylla commented 1 year ago

In fact, I think recall that in the past, SCT artifacts contained original journald Scylla logs, no? But nowadays we no longer have them. Maybe I recall wrong. But this is what I need -- original Scylla logs, with timestamps assigned by Scylla using Scylla machine clock at the moment of logging the event, not some arbitrary time later when the log message is sent to a remote machine.

It looks like at some point we switched from rsyslog to syslog-ng, maybe I remember how logging worked from rsyslog times?

kbr-scylla commented 1 year ago

Is it possible to configure the test I'm running to use rsyslog like in the past?

kbr-scylla commented 1 year ago

I found this in configure_syslogng_target_script:

        if grep -P "keep-timestamp\\([^)]+\\)" /etc/syslog-ng/syslog-ng.conf; then
            sed -i -r "s/keep-timestamp([ ]*yes[ ]*)/keep-timestamp(no)/g" /etc/syslog-ng/syslog-ng.conf
        else
            sed -i -r "s/([ \t]*options[ \t]*\\\\{{)/\\\\1\\n  keep-timestamp(no);\\n/g" /etc/syslog-ng/syslog-ng.conf
        fi

Why are we setting keep-timestamp=no? https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.16/administration-guide/keep-timestamp

If disabled, the time of reception will be used instead.

soyacz commented 1 year ago

I need all log messages. I cannot afford losing log messages, because it makes debugging borderline impossible. Log messages need to be saved by Scylla process locally on the Scylla machine disk. Then this log file needs to be collected as an artifact

See system.log inside each node logs bundle. Sometimes it is not provided (in case node was evicted before test end).

2023-10-07T10:46:17.572+00:00 longevity-ParallelAddDropColumnMonk-db-node-9f0e827b-2      !ERR | syslog-ng[4612]: EOF occurred while idle; fd='37'

Is new, might be related to recent change where we added more precision to timestamps

soyacz commented 1 year ago

I found this in configure_syslogng_target_script:

        if grep -P "keep-timestamp\\([^)]+\\)" /etc/syslog-ng/syslog-ng.conf; then
            sed -i -r "s/keep-timestamp([ ]*yes[ ]*)/keep-timestamp(no)/g" /etc/syslog-ng/syslog-ng.conf
        else
            sed -i -r "s/([ \t]*options[ \t]*\\\\{{)/\\\\1\\n  keep-timestamp(no);\\n/g" /etc/syslog-ng/syslog-ng.conf
        fi

Why are we setting keep-timestamp=no? https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.16/administration-guide/keep-timestamp

If disabled, the time of reception will be used instead.

@dkropachev do you remember why we set keep-timestamp to 'no'?

kbr-scylla commented 1 year ago

See system.log inside each node logs bundle.

Hm, indeed the run I'm looking at now has this, and it looks like what I need more or less, the timestamps are different than the ones in messages.log so I'm guessing they are Scylla timestamps. The only downside being that system.log doesn't have millisecond resolution, can we change it to provide milliseconds like messages.log did?

fgelcer commented 1 year ago

See system.log inside each node logs bundle.

Hm, indeed the run I'm looking at now has this, and it looks like what I need more or less, the timestamps are different than the ones in messages.log so I'm guessing they are Scylla timestamps. The only downside being that system.log doesn't have millisecond resolution, can we change it to provide milliseconds like messages.log did?

it was recently added #6657

soyacz commented 1 year ago

Possibly https://github.com/scylladb/scylla-cluster-tests/pull/6657 introduced regression or this test is somehow specific that led to this error: syslog-ng[4612]: EOF occurred while idle; fd='37' @kbr-scylla was there problem with too many open file descriptors, network interruptions?

fruch commented 1 year ago

Those usually mean we were sending out too much logs, I.e. Scylla was generating too much logs to be sent via network.

SCT needs the log at runtime where the test is running

We are not gonna change that, we are collecting at the end of the test, but nodes that destroyed we don't (i.e. decommission, removed nodes and such)

We could add log collection to those, and that should be enough

kbr-scylla commented 1 year ago

@soyacz

Possibly https://github.com/scylladb/scylla-cluster-tests/pull/6657 introduced regression or this test is somehow specific that led to this error: syslog-ng[4612]: EOF occurred while idle; fd='37'

I have a run from before #6657 which also has this problem: https://jenkins.scylladb.com/job/scylla-staging/job/KamilBraun/job/longevity-schema-changes-3h-test-artsiom_mishuta-clone/6/consoleText https://cloudius-jenkins-test.s3.amazonaws.com/13f60000-7013-4d3c-9973-e8ecf7fb94b8/20230929_194957/db-cluster-13f60000.tar.gz

so #6657 is not related.

The second message just after the EOF one is:

2023-09-29T16:48:54+00:00 longevity-ParallelAddDropColumnMonk-db-node-13f60000-1   !NOTICE | syslog-ng[4613]: Syslog connection closed; fd='37', server='AF_INET(10.12.3.77:32768)', time_reopen='60'

I think it's just transient network failures, which is expected (networks are unreliable), we should probably accept that they happen and deal with them reliably.

@fruch

SCT needs the log at runtime where the test is running

But shouldn't logs be buffered, and the buffer sent once syslog-ng reconnects?

fruch commented 1 year ago

@soyacz

Possibly https://github.com/scylladb/scylla-cluster-tests/pull/6657 introduced regression or this test is somehow specific that led to this error: syslog-ng[4612]: EOF occurred while idle; fd='37'

I have a run from before #6657 which also has this problem: https://jenkins.scylladb.com/job/scylla-staging/job/KamilBraun/job/longevity-schema-changes-3h-test-artsiom_mishuta-clone/6/consoleText https://cloudius-jenkins-test.s3.amazonaws.com/13f60000-7013-4d3c-9973-e8ecf7fb94b8/20230929_194957/db-cluster-13f60000.tar.gz

so #6657 is not related.

The second message just after the EOF one is:

2023-09-29T16:48:54+00:00 longevity-ParallelAddDropColumnMonk-db-node-13f60000-1   !NOTICE | syslog-ng[4613]: Syslog connection closed; fd='37', server='AF_INET(10.12.3.77:32768)', time_reopen='60'

I think it's just transient network failures, which is expected (networks are unreliable), we should probably accept that they happen and deal with them reliably.

@fruch

SCT needs the log at runtime where the test is running

But shouldn't logs be buffered, and the buffer sent once syslog-ng reconnects?

We don't set the buffer to a very big size, We know that in cases of scylla spamming with thousands or more per second those buffers aren't enough.

Maybe we should play a bit and enlarge them, but we should be addressing the scylla issues causing such bursts

soyacz commented 1 year ago

TL;DR can I have logs gathered locally from Scylla process locally on the Scylla machine with timestamps assigned by Scylla using Scylla machine clock?

https://github.com/scylladb/scylla-cluster-tests/commit/6558f89b02598dcaeeb639f52ea0e234bd2321c6 is fixing this request.

As to setting original time in 'messages.log'

Why are we setting keep-timestamp=no?

I was thinking about it, and possible reason is to have time synchronized for multi-dc scenarios

Because of that and there's not much we can do about syslog-ng issues in very high traffic circumstances, which usually don't show up, can we consider this issue as closed?

kbr-scylla commented 1 year ago

I was thinking about it, and possible reason is to have time synchronized for multi-dc scenarios

Multi-DC makes things even worse for client-side timestamps, the timestamps will be even more desynchronized due to latencies of transmitting the log message.

It may still be better to use server-side (Scylla) timestamps, clocks should be synchronized by NTP anyway which should be much better than adding transmit latency to each message timestamp.

Because of that and there's not much we can do about syslog-ng issues in very high traffic circumstances, which usually don't show up

This issue happens on every run of the SCT job I'm using to debug one issue, and it affected my ability to debug it. (Until I found about system.log, which seems to solve my problem, although I'm worried about system.log being lost in some cases.)

How have you chosen the buffer size? Perhaps it's too small. The nodes are not spamming logs that much that we should start losing them after a 30 second connectivity pause.

fruch commented 1 year ago

I was thinking about it, and possible reason is to have time synchronized for multi-dc scenarios

Multi-DC makes things even worse for client-side timestamps, the timestamps will be even more desynchronized due to latencies of transmitting the log message.

It may still be better to use server-side (Scylla) timestamps, clocks should be synchronized by NTP anyway which should be much better than adding transmit latency to each message timestamp.

Because of that and there's not much we can do about syslog-ng issues in very high traffic circumstances, which usually don't show up

This issue happens on every run of the SCT job I'm using to debug one issue, and it affected my ability to debug it. (Until I found about system.log, which seems to solve my problem, although I'm worried about system.log being lost in some cases.)

I agree that if we need the system.log (for whatever reason, timestamp is only one of the reason we want, we also filter some logs out, i.e. logs that SCT doesn't need, non scylla logs for example)

we do need to make sure that we have it collected also on cases that the test is taking down a node.

How have you chosen the buffer size? Perhaps it's too small. The nodes are not spamming logs that much that we should start losing them after a 30 second connectivity pause.

I don't which calculation was done to select the size, so from my POV it wasn't, and we can re-open it and pick a different probably higher number

given that, I think we should open an issue of each of those, and discuss how to make them happen each on it's own

fruch commented 1 year ago

@soyacz can you raise the issues from above, and put them on our board ?

soyacz commented 1 year ago
dimakr commented 2 months ago

Another occurrence of the issue, which introduced false positive in the test result during longevity-schema-topology-changes-12h-test. The test issues nodetool rebuild command during AddRemoveDc Nemesis and then waits for rebuild.*started kind of log entries to appear, which usually appear within seconds (in this particular test there were 2 other successful rebuilds, where the log records appeared within 5 seconds). Around the same time syslong losts connection and waits for reconnect for 60s:

│2024-08-30T04:45:19.474+00:00 longevity-parallel-topology-schema--db-node-a507ebff-8      !ERR | syslog-ng[4720]: EOF occurred while idle; fd='16'                                                                 │
│2024-08-30T04:45:19.474+00:00 longevity-parallel-topology-schema--db-node-a507ebff-8   !NOTICE | syslog-ng[4720]: Syslog connection closed; fd='16', server='AF_INET(10.12.8.168:32768)', time_reopen='60' 

The Nemesis waited for the logs record to appear and timed out.

fruch commented 2 months ago

Another occurrence of the issue, which introduced false positive in the test result during longevity-schema-topology-changes-12h-test. The test issues nodetool rebuild command during AddRemoveDc Nemesis and then waits for rebuild.*started kind of log entries to appear, which usually appear within seconds (in this particular test there were 2 other successful rebuilds, where the log records appeared within 5 seconds). Around the same time syslong losts connection and waits for reconnect for 60s:

│2024-08-30T04:45:19.474+00:00 longevity-parallel-topology-schema--db-node-a507ebff-8      !ERR | syslog-ng[4720]: EOF occurred while idle; fd='16'                                                                 │
│2024-08-30T04:45:19.474+00:00 longevity-parallel-topology-schema--db-node-a507ebff-8   !NOTICE | syslog-ng[4720]: Syslog connection closed; fd='16', server='AF_INET(10.12.8.168:32768)', time_reopen='60' 

The Nemesis waited for the logs record to appear and timed out.

After the 60s, logs did appear ? I.e. if the test waited 2 minutes more it would find it ? or it wasn't there at all ?

We nowadays have metric to see if we had logs dropped or not (I think it's should be in 2024.1)

kbr-scylla commented 3 weeks ago

With https://github.com/scylladb/scylla-cluster-tests/issues/6688 I'm not sure if it even makes sense to collect messages.log.