Per-collection metrics for Prometheus

generall commented 11 months ago

Is your feature request related to a problem? Please describe.

Currently, all metrics in /metrics are global, meaning that it’s impossible to see differences per collection.

In addition to that, all our metrics should have per-collection granularity to allow better aggregation in Prometheus, including:

point/vector counts
REST/gRPC requests

Describe the solution you'd like

Example:

grpc_responses_min_duration_seconds{endpoint="/qdrant.Points/Upsert"} 0.000046
grpc_responses_min_duration_seconds{endpoint="/qdrant.Points/Upsert",collection="my-collection"} 0.000049
grpc_responses_min_duration_seconds{endpoint="/qdrant.Points/Upsert",collection="my-collection-2"} 0.000046

Describe alternatives you've considered

Create dedicated endpoint for each collection /collections/my-collecton/metrics but feedback from DevOps on this idea was negative.

Additional context

It might be beneficial to allow users to disable per-collection output. It is especially relevant if there are a lot of collections and metric response could become huge. But this is a nice-to-have requirement.

Note for contributors: Please consider this as tracking issue. If you think that it would be beneficial to split the task into multiple smaller PRs, please you are welcome to do so. Bounty will be rewarded for each PR independently

generall commented 11 months ago

/bounty $150

algora-pbc[bot] commented 11 months ago

💎 $150 bounty • Qdrant

Steps to solve:

Start working: Comment /attempt #3322 with your implementation plan
Submit work: Create a pull request including /claim #3322 in the PR body to claim the bounty
Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to qdrant/qdrant!

Add a bounty • Share on socials

Attempt	Started (GMT+0)	Solution
🟢 @guhitb	Jan 5, 2024, 1:25:03 AM	WIP
🔴 @gabrieldemian	Feb 1, 2024, 1:30:46 AM	WIP
🟢 @Shylock-Hg	Mar 23, 2024, 3:54:53 AM	WIP
🟢 @xhjkl	Jun 12, 2024, 1:49:11 PM	#4455

guhitb commented 11 months ago

/attempt #3322

Options

Cancel my attempt

xzfc commented 10 months ago

Currently, the /metrics collection is strongly coupled with the /telemetry collection. The /metrics endpoint is just a filtered-out telemetry metrics. (filtering happens in MetricsProvider::add_metrics trait method). Does the /telemetry endpoint need to be updated as well?

I see multiple options to implement this.

Decouple metrics and telemetry collection. This would be more performant since we can filter out unwanted metrics right away instead of collecting all of the possible data and then filtering it out on each request.
Extend telemetry collection logic and also store collection name. However, to keep REST API stable we need to aggregate it back during the GET /telemetry request.

generall commented 10 months ago

I the original design, the idea was that /metrics is just a different format for /telemetry. So I think extending telemetry is fine. One thing to be careful about - anonymization of the collection names in the telemetry. I think anonymized stats should be aggregated across all collections always

gabrieldemian commented 10 months ago

/attempt #3322

What metrics does make sense to add for single collections?

Algora profile	Completed bounties	Tech	Active attempts	Options
@gabrieldemian	1 Qdrant bounty + 0 bounties from 0 projects	Rust, TypeScript, JavaScript	qdrant/qdrant#3331	Cancel attempt

guhitb commented 10 months ago

@gabrieldemian I made some progress on my attempt, may be useful to you. Don't think I'll be able to figure the rest out. https://github.com/qdrant/qdrant/compare/master...guhitb:qdrant:master

trueleo commented 10 months ago

I was actually looking at this yesterday and wanted to give this a try. I think I can get this working based on @guhitb approach. master...trueleo:qdrant:per-collection

Screenshot_20240201_174732

I will not try to attempt this yet because @gabrieldemian is attempting. There are few problems I have noticed with mine and @guhitb approach.

Merging nested map HashMap<String, HashMap<HttpStatusCode, Arc<Mutex<OperationDurationsAggregator>>>> to HashMap<WebApiTelemetryKey, Arc<Mutex<OperationDurationsAggregator>>> leads to massive simplification of existing code and makes adding collection as optional value much easier.

pub struct WebApiTelemetryKey {
    pub request_key: String,
    pub status_code: u16,
    pub collection: Option<String>,
}

But the main issue with this is that Serde cannot serialize HashMap<Struct, _>. This can be workaround using https://crates.io/crates/serde_with.

This will however change the schema of the TelemetryData JSON entirely. Which may affect telemetry reporter

pub struct TelemetryReporter {
    telemetry_url: String,
    telemetry: Arc<Mutex<TelemetryCollector>>,
}

impl TelemetryReporter {
    fn new(telemetry: Arc<Mutex<TelemetryCollector>>) -> Self {
        let telemetry_url = if cfg!(debug_assertions) {
            "https://staging-telemetry.qdrant.io".to_string()
        } else {
            "https://telemetry.qdrant.io".to_string()
        };

        Self {
            telemetry_url,
            telemetry,
        }
    }

If the telemetry service has JSON schema requirements, then probably it is be better to figure out what the response of telemetry data should look like if we want to attach labels ( method, endpoint, status, collection ) to them. Otherwise three layer of HashMaps to get this implemented will not look good :p

@generall

Shylock-Hg commented 8 months ago

/attempt

Algora profile	Completed bounties	Tech	Active attempts	Options
@Shylock-Hg	1 Qdrant bounty + 4 bounties from 2 projects	C++, C, Shell & more		Cancel attempt

varshith257 commented 6 months ago

@generall Is this issue ready to work on?

xhjkl commented 5 months ago

/attempt #3322

Options

Cancel my attempt

qdrant / qdrant

Per-collection metrics for Prometheus #3322

💎 $150 bounty • Qdrant

Steps to solve: