redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.62k stars 585 forks source link

CI Failure in ClusterConfigTest.test_restart #6095

Closed rystsov closed 2 years ago

rystsov commented 2 years ago

https://buildkite.com/redpanda/redpanda/builds/14311#0182b1e9-cd31-49ee-8db4-40fa5404a865

Module: rptest.tests.cluster_config_test
Class:  ClusterConfigTest
Method: test_restart
====================================================================================================
test_id:    rptest.tests.cluster_config_test.ClusterConfigTest.test_restart
status:     FAIL
run time:   10.680 seconds

    AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cluster_config_test.py", line 250, in test_restart
    self._check_restart_clears()
  File "/root/tests/rptest/tests/cluster_config_test.py", line 214, in _check_restart_clears
    assert n['restart'] is True
AssertionError
rystsov commented 2 years ago

There is a problem similar to https://github.com/redpanda-data/redpanda/issues/6010

_wait_for_version_sync waits until a node has a new version and then during an assertion it asks another node which hasn't gotten a memo yet

[DEBUG - 2022-08-18 18:27:34,552 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-18:9644/v1/cluster_config/status
[DEBUG - 2022-08-18 18:27:34,554 - admin - _request - lineno:326]: Response OK, JSON: [{'node_id': 1, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 2, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 3, 'restart': False, 'config_version': 1, 'invalid': [], 'unknown': []}]
[DEBUG - 2022-08-18 18:27:35,057 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-18:9644/v1/cluster_config/status
[DEBUG - 2022-08-18 18:27:35,058 - admin - _request - lineno:326]: Response OK, JSON: [{'node_id': 1, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 2, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 3, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}]
[DEBUG - 2022-08-18 18:27:35,058 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-21:9644/v1/cluster_config
[DEBUG - 2022-08-18 18:27:35,061 - admin - _request - lineno:326]: Response OK, JSON: {'enable_rack_awareness': False, 'metrics_reporter_tick_interval': 60000, 'storage_space_alert_free_threshold_bytes': 0, 'leader_balancer_idle_timeout': 120000, 'partition_autobalancing_mode': 'node_add', 'full_raft_configuration_recovery_pattern': [], 'kafka_qdc_min_depth': 1, 'kafka_qdc_idle_depth': 77, 'health_manager_tick_interval': 180000, 'kafka_qdc_max_latency_ms': 80, 'kafka_qdc_depth_alpha': 0.8, 'kafka_qdc_window_count': 12, 'partition_autobalancing_node_availability_timeout_sec': 900, 'superusers': ['admin'], 'cloud_storage_cache_check_interval': 30000, 'cloud_storage_upload_ctrl_d_coeff': 0.0, 'kafka_qdc_enable': False, 'cloud_storage_upload_ctrl_p_coeff': -2.0, 'cloud_storage_upload_ctrl_update_interval_ms': 60000, 'cloud_storage_max_connection_idle_time_ms': 5000, 'cloud_storage_trust_file': None, 'cloud_storage_api_endpoint_port': 443, 'cloud_storage_disable_tls': False, 'cloud_storage_upload_loop_initial_backoff_ms': 100, 'cloud_storage_reconciliation_interval_ms': 1000, 'cloud_storage_credentials_source': 'config_file', 'cloud_storage_api_endpoint': None, 'cloud_storage_secret_key': None, 'storage_min_free_bytes': 5368709120, 'storage_space_alert_free_threshold_percent': 5, 'cloud_storage_enable_remote_write': False, 'kafka_qdc_window_size_ms': 1500, 'cloud_storage_enable_remote_read': False, 'kafka_rpc_server_stream_recv_buf': None, 'kafka_connections_max_per_ip': None, 'members_backend_retry_ms': 5000, 'compaction_ctrl_max_shares': 1000, 'compaction_ctrl_min_shares': 10, 'compaction_ctrl_d_coeff': 0.2, 'compaction_ctrl_i_coeff': 0.0, 'compaction_ctrl_update_interval_ms': 30000, 'node_management_operation_timeout_ms': 5000, 'leader_balancer_transfer_limit_per_shard': 512, 'kafka_qdc_depth_update_ms': 7000, 'compaction_ctrl_p_coeff': -12.5, 'kafka_mtls_principal_mapping_rules': None, 'kafka_enable_authorization': None, 'leader_balancer_mute_timeout': 300000, 'id_allocator_batch_size': 1000, 'id_allocator_log_capacity': 100, 'storage_compaction_index_memory': 134217728, 'storage_max_concurrent_replay': 1024, 'cloud_storage_cache_size': 21474836480, 'kafka_rpc_server_tcp_send_buf': None, 'segment_fallocation_step': 33554432, 'partition_autobalancing_tick_interval_ms': 30000, 'cloud_storage_manifest_upload_timeout_ms': 10000, 'storage_read_readahead_count': 10, 'retention_bytes': None, 'join_retry_timeout_ms': 200, 'raft_io_timeout_ms': 10000, 'kafka_max_bytes_per_fetch': 67108864, 'reclaim_stable_window': 10000, 'enable_pid_file': True, 'kafka_group_recovery_timeout_ms': 30000, 'reclaim_growth_window': 3000, 'fetch_session_eviction_timeout_ms': 60000, 'raft_transfer_leader_recovery_timeout_ms': 10000, 'reclaim_max_size': 4194304, 'raft_smp_max_non_local_requests': None, 'raft_timeout_now_timeout_ms': 1000, 'recovery_append_timeout_ms': 5000, 'internal_topic_replication_factor': 3, 'replicate_append_timeout_ms': 3000, 'log_segment_size': 1073741824, 'disable_batch_cache': False, 'election_timeout_ms': 1500, 'wait_for_leader_timeout_ms': 5000, 'kafka_qdc_max_depth': 100, 'cluster_id': 'placeholder', 'transaction_coordinator_delete_retention_ms': 604800000, 'storage_read_buffer_size': 131072, 'metadata_status_wait_timeout_ms': 2000, 'transaction_coordinator_cleanup_policy': 'delete', 'max_kafka_throttle_delay_ms': 60000, 'enable_idempotence': True, 'id_allocator_replication': 1, 'kafka_connections_max': None, 'abort_timed_out_transactions_interval_ms': 10000, 'cloud_storage_upload_ctrl_min_shares': 100, 'default_topic_replications': 1, 'rpc_server_listen_backlog': None, 'group_topic_partitions': 16, 'kafka_connection_rate_limit': None, 'log_compaction_interval_ms': 10000, 'cloud_storage_bucket': None, 'enable_transactions': False, 'cloud_storage_access_key': None, 'tx_timeout_delay_ms': 1000, 'transactional_id_expiration_ms': 604800000, 'fetch_max_bytes': 57671680, 'enable_coproc': False, 'fetch_reads_debounce_timeout': 1, 'log_compression_type': 'producer', 'log_message_timestamp_type': 'CreateTime', 'kafka_qdc_latency_alpha': 0.002, 'log_cleanup_policy': 'delete', 'reclaim_min_size': 131072, 'cloud_storage_max_connections': 20, 'storage_target_replay_bytes': 10737418240, 'raft_replicate_batch_window_size': 1048576, 'coproc_max_ingest_bytes': 655360, 'alter_topic_cfg_timeout_ms': 5000, 'quota_manager_gc_sec': 30000, 'rm_violation_recovery_policy': 'crash', 'abort_index_segment_size': 50000, 'seq_table_min_size': 1000, 'release_cache_on_segment_roll': False, 'kvstore_flush_interval': 10, 'partition_autobalancing_movement_batch_size_bytes': 5368709120, 'max_compacted_log_segment_size': 5368709120, 'rm_sync_timeout_ms': 10000, 'transaction_coordinator_log_segment_size': 1073741824, 'enable_leader_balancer': True, 'cloud_storage_initial_backoff_ms': 100, 'rpc_server_tcp_recv_buf': None, 'aggregate_metrics': False, 'tm_violation_recovery_policy': 'crash', 'metrics_reporter_url': 'https://m.rp.vectorized.io/v2', 'coproc_max_batch_size': 32768, 'metadata_dissemination_retry_delay_ms': 320, 'transaction_coordinator_replication': 1, 'kafka_connection_rate_limit_overrides': [], 'group_new_member_join_timeout': 30000, 'group_initial_rebalance_delay': 3000, 'group_max_session_timeout_ms': 300000, 'kafka_rpc_server_tcp_recv_buf': None, 'raft_learner_recovery_rate': 104857600, 'cloud_storage_upload_ctrl_max_shares': 1000, 'group_min_session_timeout_ms': 6000, 'zstd_decompress_workspace_bytes': 8388608, 'compaction_ctrl_backlog_size': None, 'append_chunk_size': 16384, 'auto_create_topics_enabled': False, 'disable_public_metrics': False, 'raft_recovery_default_read_size': 524288, 'rpc_server_tcp_send_buf': None, 'cloud_storage_enabled': False, 'delete_retention_ms': 604800000, 'topic_partitions_per_shard': 7000, 'raft_max_recovery_memory': None, 'raft_heartbeat_disconnect_failures': 3, 'cloud_storage_segment_max_upload_interval_sec': None, 'readers_cache_eviction_timeout_ms': 30000, 'enable_metrics_reporter': False, 'health_monitor_max_metadata_age': 10000, 'cloud_storage_upload_loop_max_backoff_ms': 10000, 'raft_heartbeat_interval_ms': 150, 'controller_backend_housekeeping_interval_ms': 1000, 'tm_sync_timeout_ms': 10000, 'raft_max_concurrent_append_requests_per_follower': 16, 'metadata_dissemination_interval_ms': 3000, 'kafka_connections_max_overrides': [], 'enable_sasl': False, 'kvstore_max_segment_size': 16777216, 'cloud_storage_metadata_sync_timeout_ms': 10000, 'default_window_sec': 1000, 'cloud_storage_segment_upload_timeout_ms': 30000, 'admin_api_require_auth': False, 'default_num_windows': 10, 'topic_partitions_reserve_shard0': 2, 'raft_heartbeat_timeout_ms': 3000, 'metrics_reporter_report_interval': 86400000, 'topic_fds_per_partition': 5, 'default_topic_partitions': 4, 'cloud_storage_readreplica_manifest_sync_timeout_ms': 30000, 'segment_appender_flush_timeout_ms': 1000, 'topic_memory_per_partition': 1048576, 'coproc_max_inflight_bytes': 10485760, 'target_quota_byte_rate': 2147483648, 'create_topic_timeout_ms': 2000, 'metadata_dissemination_retries': 30, 'features_auto_enable': True, 'reclaim_batch_cache_min_free': 67108864, 'partition_autobalancing_max_disk_usage_percent': 80, 'disable_metrics': False, 'coproc_offset_flush_interval_ms': 300000, 'cloud_storage_region': None, 'compacted_log_segment_size': 268435456}
[DEBUG - 2022-08-18 18:27:35,061 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-21:9644/v1/cluster_config/status
[DEBUG - 2022-08-18 18:27:35,062 - admin - _request - lineno:326]: Response OK, JSON: [{'node_id': 1, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 2, 'restart': True, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 3, 'restart': False, 'config_version': 1, 'invalid': [], 'unknown': []}]
rystsov commented 2 years ago

Another instance https://buildkite.com/redpanda/redpanda/builds/14311#0182b1e9-cd33-49ea-9d2e-56c46232acae