Remove endpoint to host_id mapping when removing host by host_id

sylwiaszunejko commented 5 months ago

To remove host not found in peers metadata remove_host_by_host_id is used. In most cases we want to remove host that is a duplicate of host found in peers metadata with the same endpoint but different host_id. Because of that mapping in _host_id_by_endpoint is already overwritten with new host found in peers metadata so we don't want to remove it.

This PR introduces handling the case when we want to remove host that do not have its duplicate with different host_id in peers metadata and we do need to remove mapping from _host_id_by_endpoint.

Without this change, sometimes we ended up in situation when we have endpoint->host_id mapping in Metadata, but there was no host with this host_id in Metadata _hosts, causing a KeyError.

Refs: https://github.com/scylladb/scylladb/issues/17662

kbr-scylla commented 5 months ago

Fixes: https://github.com/scylladb/scylladb/issues/17662

I'll repeat what I said on some other PR

The issue will be fixed only after we release new driver version update the driver submodule in ScyllaDB repo to use the new version. Until that happens, the fix will not be used in CI, so the issue will keep occurring. In other words this Fixes marker is incorrect

(But having a reference to the issue is still useful.)

kbr-scylla commented 5 months ago

update the driver submodule in ScyllaDB repo

Correction: update the frozen toolchain

kbr-scylla commented 5 months ago

Also @avikivity I think we should consider removing python driver from the frozen toolchain, but make it a submodule instead, this would make it easier to run our test suite against new Python driver fixes even before driver release

kbr-scylla commented 5 months ago

@sylwiaszunejko will this change actually allow the driver to reconnect in https://github.com/scylladb/scylladb/issues/17662? Does the bad mapping prevent driver from progressing and reestablishing a new connection?

kbr-scylla commented 5 months ago

I see you updated the cover letter, but commit message still has Fixes. So it will still close that issue when merged

bhalevy commented 5 months ago

Also @avikivity I think we should consider removing python driver from the frozen toolchain, but make it a submodule instead, this would make it easier to run our test suite against new Python driver fixes even before driver release

MSTM (Makes Sense To ME :))

sylwiaszunejko commented 5 months ago

Does the bad mapping prevent driver from progressing and reestablishing a new connection?

@kbr-scylla bad mapping caused KeyError and that prevented driver from adding new host and reestablishing a new connection

kbr-scylla commented 5 months ago

@avelanarius @Lorak-mmk please review this promptly, it's hurting our CI badly.

sylwiaszunejko commented 5 months ago

@avelanarius you were asking for the details about the scenario when this reproduced:

We are connecting to cluster with contact points: ['127.205.137.78']
Host 127.205.137.78:9042 is now marked up with randomly chosen host_id
Refreshing node list and token map
Added host from peers metadata - _hosts = {UUID('4c489396-537e-4cea-a21d-8adb43ce7dd4'): <Host: 127.205.137.78:9042 datacenter1>, UUID('158147c7-d69d-4d6e-be80-7f2566f3c00c'): <Host: 127.205.137.78:9042 datacenter1>}
We have two of the same endpoint with different host_ids
REMOVE HOST BY ID: 4c489396-537e-4cea-a21d-8adb43ce7dd4 (removing duplicate with random host_id) here we want to remove host with random host_id, but we do not want to remove the mapping endpoint->host_id because it was overwritten by adding host from peers metadata
Found new host to connect to: 127.205.137.86:9042
Found new host to connect to: 127.205.137.45:9042
Found new host to connect to: 127.205.137.30:9042
Found new host to connect to: 127.205.137.93:9042
Found new host to connect to: 127.205.137.54:9042
Found new host to connect to: 127.205.137.67:9042
Host 127.205.137.78:9042 has been marked down
Refreshing node list and token map using preloaded results
Removing host not found in peers metadata: <Host: 127.205.137.67:9042 datacenter1> this host was added before, but it is not present in peers so we want to remove it from _hosts, but also we want to remove the mappings - without my change host is removed from _hosts by in _host_id_by endpoint there is still mapping from 127.205.137.67:9042->host_id
Host 127.205.137.45:9042 has been marked down
Error attempting to reconnect to 127.205.137.78:9042
Found new host to connect to: 127.205.137.67:9042
Key error due to the fact that we have mapping, so we are trying to access host with endpoint 127.205.137.67:9042 but it is removed

avelanarius commented 5 months ago

@avelanarius @Lorak-mmk please review this promptly, it's hurting our CI badly.

Yes, we are aware of it, but arguably we should actually spend more time reviewing this PR, testing it more thoroughly to avoid introducing new regressions (as the previous PR that introduced the code path responsible for KeyError introduced a regression...)

sylwiaszunejko commented 5 months ago

@avelanarius I added some more logs and confirmed that there is in fact an inconsistency in peers metadata between hosts

Driver connects to 127.155.198.13, this host is used in ControlConnection to refresh node list and token map, from it we get the information about peers metadata
We add next 6 hosts and they connect without any problem
127.155.198.13 is stopped, we close all connections
We try to refresh using 127.155.198.13, it fails, we now use 127.155.198.11 in ControlConnection
We refresh using 127.155.198.11, in this host peer information there is lack of info about one of the hosts (127.155.198.91) - this host was added before, but it is not present in peers so we want to remove it from _hosts, but also we want to remove the mappings - without my change host is removed from _hosts by in _host_id_by endpoint there is still mapping from 127.155.198.91:9042->host_id

mykaul commented 5 months ago

How is system peers table propagated in the cluster?

kbr-scylla commented 5 months ago

How is system peers table propagated in the cluster?

In raft-topology mode (where I assume this was reprod), it is updated on two events:

when topology change Raft command is applied which modifies entries in system.topology (they are host ID based), recalculates system.peers data (which is IP based)
when we learn about new IP for node (e.g. on IP change, or when we learn its IP for the first time)

We refresh using 127.155.198.11, in this host peer information there is lack of info about one of the hosts (127.155.198.91) - this host was added before, but it is not present in peers so we want to remove it from _hosts, but also we want to remove the mappings - without my change host is removed from _hosts by in _host_id_by endpoint there is still mapping from 127.155.198.91:9042->host_id

Could it be that the node that new control connection was established to, simply didn't catch up with the latest system.peers yet? Like here: https://github.com/scylladb/scylladb/issues/16373#issuecomment-1983349491

I need full logs from all nodes from the entire run to confirm.

kbr-scylla commented 5 months ago

We refresh using 127.155.198.11, in this host peer information there is lack of info about one of the hosts (127.155.198.91) - this host was added before, but it is not present in peers

But is it using "preloaded results" (like you described in previous post), which might potentially be old, or is it actually fetching new system.peers?

I'm trying to understand if there's a problem on Scylla side too.

sylwiaszunejko commented 5 months ago

Could it be that the node that new control connection was established to, simply didn't catch up with the latest system.peers yet? Like here: https://github.com/scylladb/scylladb/issues/16373#issuecomment-1983349491

I think it could be like that. I am not really familiar with how system.peers is changing and when, but that what was seen from the driver side, there was an inconsistency and the driver reacted wrong to that, but this PR fixes it.

But is it using "preloaded results" (like you described in previous post), which might potentially be old, or is it actually fetching new system.peers?

this "preloaded results" are queried just before _refresh_node_list_and_token_map is called, so I assume it is not old

avikivity commented 5 months ago

Also @avikivity I think we should consider removing python driver from the frozen toolchain, but make it a submodule instead, this would make it easier to run our test suite against new Python driver fixes even before driver release

This has the downside of possibly releasing with a pre-release driver.

I would like the flexibility of testing CI with a new driver without regenerating the toolchain.

Options:

add a 'driver override' option somewhere (say configure.py) to download a driver during build
always download the latest version of the driver during build (rather than freeze it in the toolchain), developers override it with a URL
always download a pinned version of the driver during build (rather than freeze it in the toolchain), developers override it with a URL
teach people who want to test a driver to update an existing frozen toolchain with a new driver (new docker layer), push it to their personal account, update tools/toolchain/image to point to it, and run CI

mykaul commented 5 months ago

This has the downside of possibly releasing with a pre-release driver.

Releasing, or testing? We don't release with a driver (I hope - perhaps cqlsh is an exception) ?

mykaul commented 5 months ago

@sylwiaszunejko - any progress on this PR? It's hurting us in Scylla master CI.

avelanarius commented 5 months ago

We can merge it as-is, but I'm still not satisifed by the testing we have in Python Driver (but it will take more time to improve the situation)

avelanarius commented 5 months ago

As for testing (I got mixed signals who should be responsible for it):

@sylwiaszunejko is running all test.py tests with this PR to make sure there are no regressions
@muzarski will run byo_build_tests_dtest with Python Driver from this PR to make sure there are no regressions in dtests

We have to do this since Python Driver tests themselves are insufficient to detect the types of errors we recently saw in Scylla CI. (But we'll work to make the situation better in the future).

As for merging and releasing, I think this is a reasonable "gating" criteria for now.

roydahan commented 5 months ago

As for merging and releasing, I think this is a reasonable "gating" criteria for now.

I agree.

sylwiaszunejko commented 5 months ago

I executed all tests from scylla test.py and there was no problem (~20 times). I also ran the test that reproduced the problem before and everything worked fine (~300 times).

avikivity commented 5 months ago

This has the downside of possibly releasing with a pre-release driver.

Releasing, or testing? We don't release with a driver (I hope - perhaps cqlsh is an exception) ?

The driver gets bundled, otherwise cqlsh won't work, will it?

mykaul commented 5 months ago

This has the downside of possibly releasing with a pre-release driver.

Releasing, or testing? We don't release with a driver (I hope - perhaps cqlsh is an exception) ?

The driver gets bundled, otherwise cqlsh won't work, will it?

That's up to cqlsh which is:

A sub-module
Perhaps shouldn't be a sub-module, and just get installed with 'pip' ?

muzarski commented 5 months ago

I've scheduled a dtest run with 20 retries (https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/202/) using the driver version from PR. It failed, and here is the list of the failing tests:

FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[15-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_remove_and_add_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[4-20] 
FAILED concurrent_schema_changes_test.py::TestConcurrentSchemaChanges::test_create_lots_of_alters_concurrently[10-20] 
FAILED lwt_schema_modification_test.py::TestLWTSchemaModification::test_lwt_load[4-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_multi_dc[3-20] 
FAILED data_distribution_balance_test.py::TestDataDistribution::test_data_distribution_balance[LeveledCompactionStrategy-4-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_double_compaction_by_cleanup_and_ongoing_compaction[17-20] 
FAILED repair_based_node_operations_test.py::TestRepairBasedNodeOperations::test_enable_rbno_for_bootstrap[3-16-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_refresh_and_restart_after_compaction_strategy_change[SizeTieredCompactionStrategy-TimeWindowCompactionStrategy-7-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED compaction_test.py::TestCompaction::test_compaction_delete_tombstone_gc[timeout-TimeWindowCompactionStrategy-20-20] 
FAILED backup_restore_tests.py::TestBackupRestore::test_restore_snapshot_using_different_smp_setting[7-20] 
FAILED cdc_batch_test.py::TestCDCBatchesSimple::test_preimage_full_delta_batch[map_text_blob-15-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_compaction_delete_with_smp_change[19-20] 
FAILED concurrent_schema_changes_test.py::TestConcurrentSchemaChanges::test_create_lots_of_alters_concurrently[10-20] 
FAILED lwt_schema_modification_test.py::TestLWTSchemaModification::test_lwt_load[4-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[15-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_remove_and_add_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[4-20] 
FAILED data_distribution_balance_test.py::TestDataDistribution::test_data_distribution_balance[LeveledCompactionStrategy-4-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_double_compaction_by_cleanup_and_ongoing_compaction[17-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED compaction_test.py::TestCompaction::test_compaction_delete_tombstone_gc[timeout-TimeWindowCompactionStrategy-20-20] 
FAILED repair_based_node_operations_test.py::TestRepairBasedNodeOperations::test_enable_rbno_for_bootstrap[3-16-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_refresh_and_restart_after_compaction_strategy_change[SizeTieredCompactionStrategy-TimeWindowCompactionStrategy-7-20] 
FAILED backup_restore_tests.py::TestBackupRestore::test_restore_snapshot_using_different_smp_setting[7-20] 
FAILED cdc_batch_test.py::TestCDCBatchesSimple::test_preimage_full_delta_batch[map_text_blob-15-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_compaction_delete_with_smp_change[19-20]

Unfortunately, for some reason there are no detailed logs about the errors cause. cc: @fruch

sylwiaszunejko commented 5 months ago

I will try to reproduce these failures locally

Lorak-mmk commented 5 months ago

I've scheduled a dtest run with 20 retries (https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/202/) using the driver version from PR. It failed, and here is the list of the failing tests:

FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[15-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_remove_and_add_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[4-20] 
FAILED concurrent_schema_changes_test.py::TestConcurrentSchemaChanges::test_create_lots_of_alters_concurrently[10-20] 
FAILED lwt_schema_modification_test.py::TestLWTSchemaModification::test_lwt_load[4-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_multi_dc[3-20] 
FAILED data_distribution_balance_test.py::TestDataDistribution::test_data_distribution_balance[LeveledCompactionStrategy-4-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_double_compaction_by_cleanup_and_ongoing_compaction[17-20] 
FAILED repair_based_node_operations_test.py::TestRepairBasedNodeOperations::test_enable_rbno_for_bootstrap[3-16-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_refresh_and_restart_after_compaction_strategy_change[SizeTieredCompactionStrategy-TimeWindowCompactionStrategy-7-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED compaction_test.py::TestCompaction::test_compaction_delete_tombstone_gc[timeout-TimeWindowCompactionStrategy-20-20] 
FAILED backup_restore_tests.py::TestBackupRestore::test_restore_snapshot_using_different_smp_setting[7-20] 
FAILED cdc_batch_test.py::TestCDCBatchesSimple::test_preimage_full_delta_batch[map_text_blob-15-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_compaction_delete_with_smp_change[19-20] 
FAILED concurrent_schema_changes_test.py::TestConcurrentSchemaChanges::test_create_lots_of_alters_concurrently[10-20] 
FAILED lwt_schema_modification_test.py::TestLWTSchemaModification::test_lwt_load[4-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[15-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_remove_and_add_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[4-20] 
FAILED data_distribution_balance_test.py::TestDataDistribution::test_data_distribution_balance[LeveledCompactionStrategy-4-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_double_compaction_by_cleanup_and_ongoing_compaction[17-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED compaction_test.py::TestCompaction::test_compaction_delete_tombstone_gc[timeout-TimeWindowCompactionStrategy-20-20] 
FAILED repair_based_node_operations_test.py::TestRepairBasedNodeOperations::test_enable_rbno_for_bootstrap[3-16-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_refresh_and_restart_after_compaction_strategy_change[SizeTieredCompactionStrategy-TimeWindowCompactionStrategy-7-20] 
FAILED backup_restore_tests.py::TestBackupRestore::test_restore_snapshot_using_different_smp_setting[7-20] 
FAILED cdc_batch_test.py::TestCDCBatchesSimple::test_preimage_full_delta_batch[map_text_blob-15-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_compaction_delete_with_smp_change[19-20]

Unfortunately, for some reason there are no detailed logs about the errors cause. cc: @fruch

I'm wondering if those are issues caused by driver or just flaky tests / Scylla issues etc. Do we have a list of failing test with released driver?

avelanarius commented 5 months ago

I've scheduled a dtest run with 20 retries (https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/202/) using the driver version from PR. It failed, and here is the list of the failing tests:

FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[15-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_remove_and_add_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[4-20] 
FAILED concurrent_schema_changes_test.py::TestConcurrentSchemaChanges::test_create_lots_of_alters_concurrently[10-20] 
FAILED lwt_schema_modification_test.py::TestLWTSchemaModification::test_lwt_load[4-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_multi_dc[3-20] 
FAILED data_distribution_balance_test.py::TestDataDistribution::test_data_distribution_balance[LeveledCompactionStrategy-4-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_double_compaction_by_cleanup_and_ongoing_compaction[17-20] 
FAILED repair_based_node_operations_test.py::TestRepairBasedNodeOperations::test_enable_rbno_for_bootstrap[3-16-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_refresh_and_restart_after_compaction_strategy_change[SizeTieredCompactionStrategy-TimeWindowCompactionStrategy-7-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED compaction_test.py::TestCompaction::test_compaction_delete_tombstone_gc[timeout-TimeWindowCompactionStrategy-20-20] 
FAILED backup_restore_tests.py::TestBackupRestore::test_restore_snapshot_using_different_smp_setting[7-20] 
FAILED cdc_batch_test.py::TestCDCBatchesSimple::test_preimage_full_delta_batch[map_text_blob-15-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_compaction_delete_with_smp_change[19-20] 
FAILED concurrent_schema_changes_test.py::TestConcurrentSchemaChanges::test_create_lots_of_alters_concurrently[10-20] 
FAILED lwt_schema_modification_test.py::TestLWTSchemaModification::test_lwt_load[4-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[15-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_remove_and_add_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[4-20] 
FAILED data_distribution_balance_test.py::TestDataDistribution::test_data_distribution_balance[LeveledCompactionStrategy-4-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_double_compaction_by_cleanup_and_ongoing_compaction[17-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED compaction_test.py::TestCompaction::test_compaction_delete_tombstone_gc[timeout-TimeWindowCompactionStrategy-20-20] 
FAILED repair_based_node_operations_test.py::TestRepairBasedNodeOperations::test_enable_rbno_for_bootstrap[3-16-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_refresh_and_restart_after_compaction_strategy_change[SizeTieredCompactionStrategy-TimeWindowCompactionStrategy-7-20] 
FAILED backup_restore_tests.py::TestBackupRestore::test_restore_snapshot_using_different_smp_setting[7-20] 
FAILED cdc_batch_test.py::TestCDCBatchesSimple::test_preimage_full_delta_batch[map_text_blob-15-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_compaction_delete_with_smp_change[19-20]

Unfortunately, for some reason there are no detailed logs about the errors cause. cc: @fruch

I'm wondering if those are issues caused by driver or just flaky tests / Scylla issues etc. Do we have a list of failing test with released driver?

I'd guess flaky tests :(, but test_topology_add_decommission_reboot seems concerning.

fruch commented 5 months ago

I've scheduled a dtest run with 20 retries (https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/202/) using the driver version from PR. It failed, and here is the list of the failing tests:

FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[15-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_remove_and_add_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[4-20] 
FAILED concurrent_schema_changes_test.py::TestConcurrentSchemaChanges::test_create_lots_of_alters_concurrently[10-20] 
FAILED lwt_schema_modification_test.py::TestLWTSchemaModification::test_lwt_load[4-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_multi_dc[3-20] 
FAILED data_distribution_balance_test.py::TestDataDistribution::test_data_distribution_balance[LeveledCompactionStrategy-4-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_double_compaction_by_cleanup_and_ongoing_compaction[17-20] 
FAILED repair_based_node_operations_test.py::TestRepairBasedNodeOperations::test_enable_rbno_for_bootstrap[3-16-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_refresh_and_restart_after_compaction_strategy_change[SizeTieredCompactionStrategy-TimeWindowCompactionStrategy-7-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED compaction_test.py::TestCompaction::test_compaction_delete_tombstone_gc[timeout-TimeWindowCompactionStrategy-20-20] 
FAILED backup_restore_tests.py::TestBackupRestore::test_restore_snapshot_using_different_smp_setting[7-20] 
FAILED cdc_batch_test.py::TestCDCBatchesSimple::test_preimage_full_delta_batch[map_text_blob-15-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_compaction_delete_with_smp_change[19-20] 
FAILED concurrent_schema_changes_test.py::TestConcurrentSchemaChanges::test_create_lots_of_alters_concurrently[10-20] 
FAILED lwt_schema_modification_test.py::TestLWTSchemaModification::test_lwt_load[4-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[15-20] 
FAILED cluster_replacement_test.py::TestClusterReplacement::test_rolling_cluster_replacement_sequentially_dead_nodes_remove_and_add_multi_dc[3-20] 
FAILED lwt_random_test.py::TestRandomPaxos::test_topology_add_decommission_reboot[4-20] 
FAILED data_distribution_balance_test.py::TestDataDistribution::test_data_distribution_balance[LeveledCompactionStrategy-4-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_double_compaction_by_cleanup_and_ongoing_compaction[17-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED deletion_test.py::TestRangeDeletion::test_update_by_1ck_range[LeveledCompactionStrategy-8-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED cleanup_test.py::TestCleanup::test_cluster_cleanup_no_resurrection[11-20] 
FAILED compaction_test.py::TestCompaction::test_compaction_delete_tombstone_gc[timeout-TimeWindowCompactionStrategy-20-20] 
FAILED repair_based_node_operations_test.py::TestRepairBasedNodeOperations::test_enable_rbno_for_bootstrap[3-16-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_refresh_and_restart_after_compaction_strategy_change[SizeTieredCompactionStrategy-TimeWindowCompactionStrategy-7-20] 
FAILED backup_restore_tests.py::TestBackupRestore::test_restore_snapshot_using_different_smp_setting[7-20] 
FAILED cdc_batch_test.py::TestCDCBatchesSimple::test_preimage_full_delta_batch[map_text_blob-15-20] 
FAILED compaction_additional_test.py::TestCompactionAdditional::test_compaction_delete_with_smp_change[19-20]

Unfortunately, for some reason there are no detailed logs about the errors cause. cc: @fruch

I'm wondering if those are issues caused by driver or just flaky tests / Scylla issues etc. Do we have a list of failing test with released driver?

Running all tests with repeat each 20, takes more than the default 4h timeout for the whole job, so some tests were killed in the middle, and no results were published.

I would recommend just running the gating multiple times, that would be enough.

fruch commented 5 months ago

@sylwiaszunejko @kbr-scylla @avelanarius

I've gave it one more run (without repeating tests)

https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/203/

roydahan commented 5 months ago

@sylwiaszunejko @kbr-scylla @avelanarius

I've gave it one more run (without repeating tests)

https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/203/

It passed successfully. Just to be on the safe side, I started another run with 3 repeats and extended timeout of 8 hours here: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/204/

roydahan commented 5 months ago

Just to be on the safe side, I started another run with 3 repeats and extended timeout of 8 hours here: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/204/

This one also passed successfully. I suggest we merge this.

bhalevy commented 5 months ago

This causes frequent ci failures let's merge it asap

Lorak-mmk commented 5 months ago

New driver version (3.26.8) with this change was releases (thanks @sylwiaszunejko ) and is available on pypi.

scylladb / python-driver

Remove endpoint to host_id mapping when removing host by host_id #308