Open Michael-JB opened 1 week ago
According to your GET /collections/my_collection/cluster
, it is not replicated.
If you changed replication factor after collection is already created, you need to follow this https://qdrant.tech/documentation/guides/distributed_deployment/#creating-new-shard-replicas
Hi @generall, thank you for the quick reply.
To clarify, I am not changing the replication factor after creating the collection. This configuration is set on a fresh installation (no existing collections). I then create a collection, add points, and observe these responses. I configure the replication_factor
via the helm values if that's of any significance.
If the shards are replicated, would I expect to see all shards under the local_shards
key the response to GET /collections/my_collection/cluster
?
If the shards are replicated, would I expect to see all shards under the local_shards key the response to GET /collections/my_collection/cluster?
yes
I configure the replication_factor via the helm values if that's of any significance.
I am not sure if the helm chart replicaCount
is the same thing as replication factor in the collection.
@generall To be more precise, I set config.storage.collection.replication_factor
in the helm chart, which overrides this value in the Qdrant config (via production.yaml
). I then rely on this to set replication_factor
when creating the collection, rather than explicitly specifying replication_factor
in the collection create request. This seems to work as the replication_factor
I specify propagates to the config returned by GET /collections/my_collection
, but perhaps this value is misleading. Could it be the case that setting replication_factor
via config behaves differently to setting it in the create request body?
Hi @generall, I created a test for this in a local sandbox and can confirm that my above suspicions hold -- this looks like a bug. TL;DR:
If you configure replication_factor
via the global config, new collections pick up this configuration but do not actually create shard replicas. If you explicitly specify replication_factor
in the collection create request, everything works as expected.
Here is a full repro:
kind
and create new cluster: kind create cluster
helm repo add qdrant https://qdrant.github.io/qdrant-helm/ && helm repo update
values.yaml
file containing:
# Create cluster with 2 Qdrant nodes
replicaCount: 2
config:
cluster:
enabled: true
# Set default collection replication factor to 2
storage:
collection:
replication_factor: 2
helm install qdrant qdrant/qdrant -f values.yaml
localhost:8002
: kubectl port-forward service/qdrant 8002:6333
import numpy as np
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
client = QdrantClient(host="localhost", port=8002)
collections = ["collection_config_implicit", "collection_config_explicit"]
print("Creating collections...") client.create_collection( collection_name=collections[0], # replication config implicit vectors_config=VectorParams(size=128, distance=Distance.COSINE), shard_number=2, ) client.create_collection( collection_name=collections[1], # replication config explicit vectors_config=VectorParams(size=128, distance=Distance.COSINE), shard_number=2, replication_factor=2, )
print("Uploading vectors...") for collection_name in collections: vectors = np.random.rand(100, 128) client.upsert( collection_name, points=[ PointStruct( id=idx, vector=vector.tolist(), ) for idx, vector in enumerate(vectors) ], )
print("Running tests...")
collection_replication_factors = [ client.get_collection(collection_name).config.params.replication_factor for collection_name in collections ] assert collection_replication_factors == [2, 2]
collection_config_implicit
has only one local shard;collection_cluster_infos = [ client.http.cluster_api.collection_cluster_info(collection_name).result for collection_name in collections ] local_shard_counts = [ len(info.local_shards) if info else -1 for info in collection_cluster_infos ] assert local_shard_counts == [2, 2], f"Expected [2, 2], got {local_shard_counts}"
Things to note:
1. If this is a bug for `replication_factor`, this may extend to other fields configured via global config, e.g., `write_consistency_factor`. I haven't tested this.
2. I used the helm chart in this repo as it's a convenient way to spin up a Qdrant cluster. The Qdrant chart sets `replication_factor` in the `production.yaml` config file. I expect this bug is internal to Qdrant, i.e., you'll still see this if you're not using the helm chart/k8s.
Current Behavior
Hi! I'm setting up a 3-node Qdrant cluster in k8s using the Qdrant helm chart. My deployment is not behaving as I expect (or as documented).
The deployment
I've confirmed that these values are set correctly by querying the
/cluster
and/collections/:collection_name/cluster
endpoints (see below). I see 12 shards as expected, and that these shards are evenly distributed amongst the nodes. I also see that the replication factor is set to 3 in the collection config, although I've found no way to confirm that these replicas actually exist via the API.The problem
When I kill a node in the Qdrant cluster (kill a pod managed by the k8s sts), all search queries sent to the Qdrant service return a 500 until the node recovers. The other Qdrant nodes repeatedly log the following while the node is down:
It's as if the collection shards are not replicated across the cluster.
Expected Behavior
Per the documentation here, I would expect all requests to work while the node is temporarily down.
As a side note, it would be nice to have an API to locate shard replicas.
Context (Environment)
My goal is a HA deployment such that Qdrant remains available (1) during upgrades; and (2) during temporary node failures.
Detailed Description
Qdrant version:
v1.12.0
GET
/collections/my_collection/cluster
:GET
/cluster
:GET
/collections/my_collection
: