sentry-kubernetes / charts

Easily deploy Sentry on your Kubernetes Cluster
MIT License
1.06k stars 506 forks source link

On first install snuba-migrate job takes ages to complete #1495

Open maitredede opened 1 day ago

maitredede commented 1 day ago

Issue submitter TODO list

Describe the bug (actual behavior)

When installing for the first time the sentry chart, the job snuba-migrate is taking a long time...

Expected behavior

No response

values.yaml

Extract of my values.yaml :

hooks:
  enabled: true
  removeOnSuccess: false
  activeDeadlineSeconds: 3600

Helm chart version

From develop branch (commit 1782966c) :

version: 25.10.0
appVersion: 24.7.1

Steps to reproduce

helm install ... --values values.yaml --wait --timeout=120m

Screenshots

No response

Logs

An small extract of logs, timestamps shows "create table" took 13s, and on first install there are lots of requests that takes this kind of delay

2024-09-30 16:35:34,481 Query: CREATE TABLE IF NOT EXISTS generic_metric_sets_meta_local (org_id UInt64, project_id UInt64, use_case_id LowCardinality(String), metric_id UInt64, tag_key UInt64, timestamp DateTime CODEC (DoubleDelta), retention_days UInt16, count AggregateFunction(sum, Float64)) ENGINE ReplicatedAggregatingMergeTree('/clickhouse/tables/generic_metrics_sets/{shard}/default/generic_metric_sets_meta_local', '{replica}') PRIMARY KEY (org_id, project_id, use_case_id, metric_id, tag_key, timestamp) ORDER BY (org_id, project_id, use_case_id, metric_id, tag_key, timestamp) PARTITION BY toMonday(timestamp) TTL timestamp + toIntervalDay(retention_days) SETTINGS index_granularity=8192, ttl_only_drop_parts=0;
2024-09-30 16:35:34,482 Block "" send time: 0.000054
2024-09-30 16:35:47,092 Query: select count(*) from system.mutations where is_done=0
2024-09-30 16:35:47,092 Block "" send time: 0.000056

Additional context

Storage disks are slow : ceph rbd. I have moved some components to faster direct lvm disks (using topolvm) : for sts clickhouse, but I don't know which volumes I should migrate (zookeeper-clickhouse ?) or which component I should increase replica count (at least for high availability).

maitredede commented 1 day ago

Hello, I have made some improvements in long installation durations :

By adding kafka provisionning parallelism, is it way faster (had to increase limits also, was OomKilled) :

kafka:
  provisioning:
    # replicationFactor: 3
    parallel: 6
    resources:
      requests:
        cpu: "100m"
        memory: "1Gi"
      limits:
        cpu: "2"
        memory: "4Gi"

I switched zookeeper-clickhouse to a faster storageclass : snuba-migrate was waaay faster (less than 2min from almost 1h).

There is still the job db-init, because it operates on postgresql.

Since I would like to achieve high availability, I can easily scale kafka, zookeeper, clickhouse instances, but what about other components ? To have HA, I have at least to keep volumes on a replicated (slow) storageclass (ceph RBD), for postgresql, and contributing to performances bottleneck...