qdrant / qdrant-helm

Apache License 2.0
98 stars 46 forks source link

Consensus fails when enabling p2p tls #222

Open NathanSavageKaimai opened 1 month ago

NathanSavageKaimai commented 1 month ago

Problem

Currently, if config.cluster.p2p.enable_tls is set to true in values.yaml, the different pods in the cluster will fail to communicate with each other.

In my logs I have the following:

Starting initializing for pod 0
           _                 _    
  __ _  __| |_ __ __ _ _ __ | |_  
 / _` |/ _` | '__/ _` | '_ \| __| 
| (_| | (_| | | | (_| | | | | |_  
 \__, |\__,_|_|  \__,_|_| |_|\__| 
    |_|                           

Version: 1.10.0, build: 851f03bb
Access web UI at http://localhost:6333/dashboard

2024-08-13T11:16:44.533771Z  INFO storage::content_manager::consensus::persistent: Loading raft state from ./storage/raft_state.json    
2024-08-13T11:16:44.686031Z  INFO qdrant: Telemetry reporting enabled, id: 764cb1fe-6733-402a-9d5b-05199641d557    
2024-08-13T11:16:44.696863Z  INFO qdrant::actix: TLS disabled for REST API    
2024-08-13T11:16:44.699864Z  INFO qdrant::actix: Qdrant HTTP listening on 6333    
2024-08-13T11:16:44.704436Z  INFO actix_server::builder: Starting 11 workers
2024-08-13T11:16:44.704480Z  INFO actix_server::server: Actix runtime found; starting in Actix runtime
2024-08-13T11:16:44.711245Z  INFO qdrant::tonic: Qdrant gRPC listening on 6334    
2024-08-13T11:16:44.711306Z  INFO qdrant::tonic: TLS disabled for gRPC API    
2024-08-13T11:16:44.712137Z  INFO qdrant::tonic: TLS enabled for internal gRPC API (TTL not supported)    
2024-08-13T11:16:45.080494Z ERROR qdrant::common::health: GetConsensusCommit request failed: Error in closure supplied to transport channel pool: status: Unavailable, message: "Failed to connect to http://fido-qdrant-2.fido-qdrant-headless:6335/, error: transport error", details: [], metadata: MetadataMap { headers: {} }    
2024-08-13T11:16:45.083961Z ERROR qdrant::common::health: GetConsensusCommit request failed: Error in closure supplied to transport channel pool: status: Unavailable, message: "Failed to connect to http://fido-qdrant-1.fido-qdrant-headless:6335/, error: transport error", details: [], metadata: MetadataMap { headers: {} }    
2024-08-13T11:16:47.198819Z  WARN storage::content_manager::consensus_manager: Failed to send message to http://fido-qdrant-2.fido-qdrant-headless:6335/ with error: Error in closure supplied to transport channel pool: status: Unavailable, message: "Failed to connect to http://fido-qdrant-2.fido-qdrant-headless:6335/, error: transport error", details: [], metadata: MetadataMap { headers: {} }    
2024-08-13T11:16:47.201503Z  WARN storage::content_manager::consensus_manager: Failed to send message to http://fido-qdrant-1.fido-qdrant-headless:6335/ with error: Error in closure supplied to transport channel pool: status: Unavailable, message: "Failed to connect to http://fido-qdrant-1.fido-qdrant-headless:6335/, error: transport error", details: [], metadata: MetadataMap { headers: {} }

Possible cause

The pods are trying to communicate with http. It follows that with tls enabled they should be talking via https.

The bootstrap address and uri are set by startup flags in the config map template. They are set to use http regardless of if internal tls is enabled

Potential Solution

This could be fixed by modifying the template to query if internal tls is enabled.

I will test if this fixes the issue shortly and make a PR

kinoute commented 3 weeks ago

I can confirm that the proposed PR works. It would be nice to have it in the official Helm release!