scylladb / argus

Apache License 2.0
4 stars 10 forks source link

Nemesis `Start` and `end` dates have significantly incorrect values #494

Open vponomaryov opened 1 day ago

vponomaryov commented 1 day ago

What shows Argus: Image

The same as text:

+--------------------------------------+-------------------------------------------------------+
| 1. disrupt_truncate_large_partition  | long-custom-d2-wrkld1-5dc-dev-db-node-454bea32-11     |
|    Status: Succeeded                 | Start: 2024-10-28 23:16:37 | End: 2024-10-28 23:22:24 |
+--------------------------------------+-------------------------------------------------------+
| 2. disrupt_stop_start_scylla_server  | long-custom-d2-wrkld1-5dc-dev-db-node-454bea32-49     |
|    Status: Succeeded                 | Start: 2024-10-28 21:35:58 | End: 2024-10-28 21:36:48 |
+--------------------------------------+------------------------------------------------------+

Real time values according to the jenkins logs are following:

2024-10-28 17:29:01,408 f:sct_config.py   l:2057 c:sdcm.sct_config      p:INFO  > n_db_nodes: 12 12 12 12 12
2024-10-28 17:29:01,408 f:sct_config.py   l:2057 c:sdcm.sct_config      p:INFO  > n_loaders: 5 1 1 1 1
...
2024-10-28 21:02:28,434 f:tester.py       l:750  c:LongevityTest        p:INFO  > Test start time 2024-10-28 21:02:28, duration is 539 and timeout set to 2024-10-29 06:01:28
...
2024-10-28 21:43:31,370 f:loader_utils.py l:290  c:LongevityTest        p:INFO  > Prepare finished
...
...
2024-10-28 21:10:22,686 f:nemesis.py      l:1897 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method stop_start_scylla_server
...
2024-10-28 22:23:50,978 f:nemesis.py      l:1901 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: <<<<<<<<<<<<<Finished random_disrupt_method stop_start_scylla_server
...
...
2024-10-28 22:28:50,979 f:nemesis.py      l:1897 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method truncate_large_partition
...
2024-10-29 00:10:19,140 f:nemesis.py      l:1901 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: <<<<<<<<<<<<<Finished random_disrupt_method truncate_large_partition
...
...
2024-10-29 00:15:19,142 f:nemesis.py      l:1897 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method bootstrap_streaming_error
...
...
2024-10-29 00:47:22,216 f:tester.py       l:2959 c:LongevityTest        p:INFO  > TearDown is starting...
...
2024-10-29 00:51:07,179 f:nemesis.py      l:452  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Report
2024-10-29 00:51:07,179 f:nemesis.py      l:453  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: DB Version: 6.3.0~dev
2024-10-29 00:51:07,179 f:nemesis.py      l:454  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Interval: 300 s
2024-10-29 00:51:07,179 f:nemesis.py      l:455  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Average duration: 186.5 s
2024-10-29 00:51:07,179 f:nemesis.py      l:456  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Total execution time: 13668 s
2024-10-29 00:51:07,179 f:nemesis.py      l:457  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Times executed: 2
2024-10-29 00:51:07,179 f:nemesis.py      l:458  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Unexpected errors: 0
2024-10-29 00:51:07,179 f:nemesis.py      l:459  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Operation log:
2024-10-29 00:51:07,179 f:nemesis.py      l:461  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: {\
    'operation': 'StopStartScyllaServer',
    'start': 1730151358,
    'end': 1730151386,
    'duration': 27,
    'node': 'Node long-custom-d2-wrkld1-5dc-dev-db-node-454bea32-49 [44.208.35.41 | 10.12.7.49] (dc name: us-east-1)',
    'subtype': 'end'}
2024-10-29 00:51:07,179 f:nemesis.py      l:461  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: {\
    'operation': 'TruncateLargePartition', 
    'start': 1730157397, 
    'end': 1730157743, 
    'duration': 346, 
    'node': 'Node long-custom-d2-wrkld1-5dc-dev-db-node-454bea32-11 [3.76.36.146 | 10.11.6.209] (dc name: eu-central-1)', 
    'subtype': 'end'}
2024-10-29 00:51:07,179 f:cluster.py      l:4521 c:sdcm.cluster         p:INFO  > Cluster long-custom-d2-wrkld1-5dc-dev-db-cluster-454bea32 (\
    AMI: [
        'ami-0cbeb1d253aae6204', 
        'ami-07ba85ecd266cdea7', 
        'ami-0285a208b33f03cb9', 
        'ami-0a8cc93a1aef18f7e', 
        'ami-030a158b040cdac2f'
    ] Type: i3en.large): Set _nemesis_termination_event
...
2024-10-29 01:17:45,819 f:tester.py       l:3047 c:LongevityTest        p:INFO  > ================================= TEST RESULTS =================================
2024-10-29 01:17:45,819 f:tester.py       l:3047 c:LongevityTest        p:INFO  > ================================================================================
2024-10-29 01:17:45,819 f:tester.py       l:3047 c:LongevityTest        p:INFO  > SUCCESS :)

Most probably it is somehow related to the fact that the setup is 5 dcs with 12 DB nodes per each - 60 DB nodes in summary.

soyacz commented 1 day ago

IIRC it's because of health check happens between Started-Finished log lines, and nemesis time sent to Argus is calculated out of disruption method only. In big clusters healthcheck takes longer.