`node.follow_system_log` may not find a really existing log line

vponomaryov commented 2 weeks ago

Issue description

[ ] This issue is a regression.
[x] It is unknown if this issue is a regression.

Running disrupt_load_and_stream nemesis we expect appearance of start and end log messages for the load and stream operation.

In current case in first 2 used nodes logs the expected log lines were correctly found:

node-5:

2024-09-14 03:34:08,395 f:load_utils.py   l:116  c:utils                p:INFO  > Running load and stream on the node longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-5 for keyspace1.standard1'
...
2024-09-14T03:34:19.865+00:00 longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-5     !INFO | scylla[5477]:  [shard  2:strm] sstables_loader - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true, primary_replica_only=false, status=succeeded
...
2024-09-14 03:34:19,942 f:wait.py         l:161  c:sdcm.wait            p:DEBUG > Start line patterns ['(?:storage_service|sstables_loader) - load_and_stream:'] were found.
...
2024-09-14 03:34:19,954 f:wait.py         l:167  c:sdcm.wait            p:DEBUG > End line patterns ['(?:storage_service|sstables_loader) - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true.*status=(.*)'] were found.
...

node-4:

2024-09-14 03:34:25,024 f:load_utils.py   l:116  c:utils                p:INFO  > Running load and stream on the node longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-4 for keyspace1.standard1'
...
Sep 14 03:34:34.436147 longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-4 scylla[5486]:  [shard  2:strm] sstables_loader - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true, primary_replica_only=false, status=succeeded
...
2024-09-14 03:34:34,570 f:wait.py         l:161  c:sdcm.wait            p:DEBUG > Start line patterns ['(?:storage_service|sstables_loader) - load_and_stream:'] were found.
...
2024-09-14 03:34:34,741 f:wait.py         l:167  c:sdcm.wait            p:DEBUG > End line patterns ['(?:storage_service|sstables_loader) - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true.*status=(.*)'] were found.
...

But in case of the next node-3 the end log line was not detected and the 5 minutes timeout error was raised: node-3:

2024-09-14 03:34:39,579 f:load_utils.py   l:116  c:utils                p:INFO  > Running load and stream on the node longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-3 for keyspace1.standard1'
...
Sep 14 03:34:48.813688 longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-3 scylla[5499]:  [shard  2:strm] sstables_loader - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true, primary_replica_only=false, status=succeeded
...
2024-09-14 03:34:49,126 f:wait.py         l:161  c:sdcm.wait            p:DEBUG > Start line patterns ['(?:storage_service|sstables_loader) - load_and_stream:'] were found.
...
2024-09-14 03:39:40.777: (DisruptionEvent Severity.ERROR) period_type=end event_id=9716dc9f-3bb2-4d74-9713-f9616c04899b duration=6m47s: nemesis_name=LoadAndStream target_node=Node longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-1 [34.73.216.73 | 10.142.0.157] errors=Timeout occurred while waiting for end log line ['(?:storage_service|sstables_loader) - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true.*status=(.*)'] on node: longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-3
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5220, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1679, in disrupt_load_and_stream
    SstableLoadUtils.run_load_and_stream(load_on_node, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/sstable/load_utils.py", line 113, in run_load_and_stream
    with wait_for_log_lines(node, start_line_patterns=[cls.LOAD_AND_STREAM_RUN_EXPR],
  File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 166, in wait_for_log_lines
    raise TimeoutError(end_ctx)
TimeoutError: Timeout occurred while waiting for end log line ['(?:storage_service|sstables_loader) - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true.*status=(.*)'] on node: longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-3

Steps to Reproduce

Run longevity with the disrupt_load_and_stream nemesis
See error

Expected behavior: If a log line exists it must be detected

Actual behavior: Existing log line was not detected

Impact

False negative.

How frequently does it reproduce?

Unknown

Installation details

SCT Version: master Scylla version (or git commit hash): master

Logs

test_id: f25dd9b8-587a-49cb-bc78-5b8ed3226c68
job log: scylla-master/tier1/longevity-large-partition-200k-pks-4days-gce-test#35

fruch commented 1 week ago

@soyacz

what do you think, how can we get to a situation we start waiting after the action here ?

soyacz commented 1 week ago

First issue is that it discovered late the start line (it appeared at 03:34:39 and detected at 03:34:49), maybe sct was too busy, or message not flushed to disk. I don't know why. I tried analyze the code, but found only saturating cpu issue when waiting for logs: https://github.com/scylladb/scylla-cluster-tests/pull/8743 I don't know if it helps, but I think we should add bit of sleeping before recreating log reader iterator.

fruch commented 1 week ago

First issue is that it discovered late the start line (it appeared at 03:34:39 and detected at 03:34:49), maybe sct was too busy, or message not flushed to disk. I don't know why. I tried analyze the code, but found only saturating cpu issue when waiting for logs: #8743 I don't know if it helps, but I think we should add bit of sleeping before recreating log reader iterator.

maybe we can combine the patterns and wait for both at the same time ?

so at least start using both generator at the same time ?

soyacz commented 1 week ago

First issue is that it discovered late the start line (it appeared at 03:34:39 and detected at 03:34:49), maybe sct was too busy, or message not flushed to disk. I don't know why. I tried analyze the code, but found only saturating cpu issue when waiting for logs: #8743 I don't know if it helps, but I think we should add bit of sleeping before recreating log reader iterator.

maybe we can combine the patterns and wait for both at the same time ?

so at least start using both generator at the same time ?

We are starting generators at the same time - before triggering action. That's why this issue is a mystery for me.

scylladb / scylla-cluster-tests