Reloading uptrace UI while using multiple pods produces different results

alexandrujieanu commented 1 year ago

Hello,

Do you have any idea why this is happening?

Screencast from 2023-05-10 14-04-58.webm

I'm accessing the service and landing on different pods.

kubectl -n uptrace get all

NAME                       READY   STATUS    RESTARTS   AGE
pod/uptrace-0              1/1     Running   0          24h
pod/uptrace-1              1/1     Running   0          24h
pod/uptrace-2              1/1     Running   0          24h
pod/uptrace-postgresql-0   1/1     Running   0          24h

NAME                         TYPE           CLUSTER-IP       EXTERNAL-IP                                                                   PORT(S)                         AGE
service/uptrace              LoadBalancer   10.100.128.181   k8s-uptrace-uptrace-a10XXXX.elb.REGION.amazonaws.com   443:32203/TCP,14317:32604/TCP   24h
service/uptrace-postgresql   LoadBalancer   10.100.247.48    a8e5ff192b5fYYYYY.REGION.elb.amazonaws.com         5432:32391/TCP                  24h

NAME                                  READY   AGE
statefulset.apps/uptrace              3/3     24h
statefulset.apps/uptrace-postgresql   1/1     24h

Uptrace values.yaml:

service:
  type: LoadBalancer
  http_port: 443
  grpc_port: 14317
  loadBalancerSourceRanges:
  %{~ for subnet in allowed_subnets ~}
    - ${ subnet }
  %{~ endfor ~}
  annotations: {}

uptrace:
  config:
    ch:
      addr: clickhouse-headless.clickhouse.svc:9000
      user: ${clickhouse_username}
      password: ${clickhouse_password}
      database: uptrace
      max_execution_time: 30s

    ch_schema:
      compression: ZSTD(3)
      replicated: false
      spans:
        storage_policy: 'default'
        ttl_delete: 14 DAY
      metrics:
        storage_policy: 'default'
        ttl_delete: 14 DAY

Clickhouse:

kubectl -n clickhouse get pods

NAME                     READY   STATUS    RESTARTS   AGE
clickhouse-shard0-0      1/1     Running   0          23h
clickhouse-shard0-1      1/1     Running   0          23h
clickhouse-shard0-2      1/1     Running   0          23h
clickhouse-zookeeper-0   1/1     Running   0          23h
clickhouse-zookeeper-1   1/1     Running   0          23h
clickhouse-zookeeper-2   1/1     Running   0          23h

Thank you.

vmihailenco commented 1 year ago

Are you using any proxies / load balancers in front of ClickHouse?

Also please show:

select * from system.replicas format Vertical;

show tables;

vmihailenco commented 1 year ago

Also, it looks like you are have replicated: false and cluster: uptrace1 is missing completely. Is that a mistake?

alexandrujieanu commented 1 year ago

Uptrace connects to Clickhouse via the kubernetes service:

addr: clickhouse-headless.clickhouse.svc:9000

A request from pod/uptrace-0 could go to any of clickhouse-shard0-{0,1,2}, but at Clickhouse level the data should be replicated so I don't see a problem here. Do you?

I am only able to start Uptrace with replicated: false and without cluster:. Are they mandatory for such a setup?


SELECT *
FROM system.clusters
LIMIT 3
FORMAT Vertical

Query id: 5e099aa3-2730-408b-9137-511d98ab503f

Row 1:
──────
cluster:                 default
shard_num:               1
shard_weight:            1
replica_num:             1
host_name:               clickhouse-shard0-0.clickhouse-headless.clickhouse.svc.cluster.local
host_address:            172.23.4.71
port:                    9000
is_local:                1
user:                    default
default_database:        
errors_count:            0
slowdowns_count:         0
estimated_recovery_time: 0

Row 2:
──────
cluster:                 default
shard_num:               1
shard_weight:            1
replica_num:             2
host_name:               clickhouse-shard0-1.clickhouse-headless.clickhouse.svc.cluster.local
host_address:            172.23.0.167
port:                    9000
is_local:                0
user:                    default
default_database:        
errors_count:            0
slowdowns_count:         0
estimated_recovery_time: 0

Row 3:
──────
cluster:                 default
shard_num:               1
shard_weight:            1
replica_num:             3
host_name:               clickhouse-shard0-2.clickhouse-headless.clickhouse.svc.cluster.local
host_address:            172.23.2.133
port:                    9000
is_local:                0
user:                    default
default_database:        
errors_count:            0
slowdowns_count:         0
estimated_recovery_time: 0

3 rows in set. Elapsed: 0.002 sec. 

SELECT *
FROM system.replicas
FORMAT Vertical

Query id: 8f3d17a6-a7f1-46cf-9d83-158f15b8207c

Ok.

0 rows in set. Elapsed: 0.004 sec. 

SHOW TABLES

Query id: 9083038f-c60d-4523-b2d3-f9bcea6e3fc2

Ok.

0 rows in set. Elapsed: 0.002 sec.

vmihailenco commented 1 year ago

but at Clickhouse level the data should be replicated so I don't see a problem here. Do you?

It will be replicated if you use replicated: true. Then you will see the replicated tables in system.replicas view.

I am only able to start Uptrace with replicated: false and without cluster:. Are they mandatory for such a setup?

Yes.

alexandrujieanu commented 1 year ago

Okay, then I'm focusing to get Uptrace working with those two flags.

I believe you can close this issue or #12.

alexandrujieanu commented 1 year ago

Hello,

replicated: true and cluster: default seem to have fixed the issue, combined with creating the Clickhouse database before Uptrace starts. I am using the bitnami clickhouse package which doesn't accept CLICKHOUSE_DB but I did an initialization script.

I have a few remarks I want to share:

For some reason Uptrace itself creates the database when these two settings are not used and doesn't when they are.
Uptrace runs the migrations only first start, then the pod crashes, then the pod is recreated and migrations are skipped. Practically, I have been missing the first start and I was only seeing logs like:

error   tracing/span_processor.go:231   ch.Insert failed    {"error": "DB::Exception: Table uptrace.spans_index_buffer_dist doesn't exist", "table": "spans_index"}

I created a dedicated and restricted database user for Uptrace not knowing which grants are needed. I didn't want to use the default Clickhouse admin user. It would be nice to have them tested and documented.

After a lot of trial an error I ended up with:

GRANT ALL PRIVILEGES ON ${uptrace_database}.* TO '${uptrace_username}'
GRANT CLUSTER, REMOTE, SOURCES ON *.* TO '${uptrace_username}'

vmihailenco commented 1 year ago

Thanks for the notes! I definitely will spend some time reflecting on them and making changes.

For some reason Uptrace itself creates the database when these two settings are not used and doesn't when they are.

Will fix.

then the pod crashes

Do you remember why? Is it the Uptrace pod or the ClickHouse pod?

It would be nice to have them tested and documented.

:+1:

alexandrujieanu commented 1 year ago

When I run helm, the pods are coming up but as migrations fail, uptrace exists with error, container crashes and kube-scheduler restarts the pods. The pods are coming back up but this time migrations are not run anymore, uptrace runs but with errors like "Insert failed".

alexandrujieanu commented 1 year ago

@vmihailenco I suspect Uptrace is handling the replication of its database within Clickhouse. Is this correct?

Context:

I have created a dedicated database and database user and the default Clickhouse admin user is complaining that it can't access the tables created by the uptrace user. I can see this in logs:

2023.05.19 07:14:18.572003 [ 507 ] {} uptrace.measure_minutes_buffer_dist.DirectoryMonitor.default: Code: 516. DB::Exception: Received from clickhouse-shard0-1.clickhouse-headless.clickhouse.svc.cluster.local:9000. DB::Exception: default: Authentication failed: password is incorrect, or there is no user with such name.

I have read that distributed tables are replicated by the default Clickhouse admin user and in my case it didn't have access to do so.

I have granted the access and now I get the impression that both default and uptrace users are trying to do the replication.

2023.05.19 09:36:07.274130 [ 354 ] {} uptrace.spans_index (1511c250-3707-45b8-8bd2-38e437f731ca) (Replicated OutputStream): Block with ID 20230519_3745753001925908684_4954532985886563788 already exists on other replicas as part 20230519_185_185_0; will write it locally with that name. 2023.05.19 09:36:07.274408 [ 354 ] {} uptrace.spans_index (1511c250-3707-45b8-8bd2-38e437f731ca) (Replicated OutputStream): Part 20230519_185_185_0 is duplicate and it is already written by concurrent request or fetched; ignoring it. 2023.05.19 09:36:14.668583 [ 354 ] {} uptrace.spans_data (bebddfa6-e6d8-4dd3-bb36-70a3fc1a9a9d) (Replicated OutputStream): Block with ID 20230519_6567306983901795181_13630949410134203585 already exists locally as part 20230519_184_184_0; ignoring it. 2023.05.19 09:36:14.799234 [ 350 ] {} uptrace.spans_index (1511c250-3707-45b8-8bd2-38e437f731ca) (Replicated OutputStream): Block with ID 20230519_2645648279967518409_18311755029048189739 already exists on other replicas as part 20230519_186_186_0; will write it locally with that name. 2023.05.19 09:36:14.807786 [ 357 ] {} uptrace.measure_minutes (aa12e815-849d-40bb-b7b1-6fb25f754657) (Replicated OutputStream): Block with ID 20230519_2487656886605356058_6987131945011303779 already exists on other replicas as part 20230519_951_951_0; will write it locally with that name. 2023.05.19 09:36:14.849173 [ 552 ] {} uptrace.spans_index (1511c250-3707-45b8-8bd2-38e437f731ca): auto DB::StorageReplicatedMergeTree::processQueueEntry(ReplicatedMergeTreeQueue::SelectedEntryPtr)::(anonymous class)::operator()(DB::StorageReplicatedMergeTree::LogEntryPtr &) const: Code: 235. DB::Exception: Part 20230519_186_186_0 (state Active) already exists. (DUPLICATE_DATA_PART), Stack trace (when copying this message, always include the lines below)

In this case, I am considering if I should let Uptrace use the default user (which initially I didn't want to) or what other options do I have.

Thanks.

vmihailenco commented 1 year ago

No ideas here. So far I've only used ClickHouse users to limit queries complexity, not to restrict access...

alexandrujieanu commented 1 year ago

FYI, Clickhouse developers say it's a client misbehaviour.

Error Code 235 DB::Exception DUPLICATE_DATA_PART

I am not sure. I haven't seen this while using the default user.

vmihailenco commented 1 year ago

I am not sure either. There are no indications that it was a go-clickhouse client so I am inclined to agree with "There is nothing to fix" sentiment :)

caique-franca commented 1 year ago

@alexandrujieanu, I'm having the same problem as you. I talked a little more about the problem on Telegram (https://t.me/uptrace/1419). Did you get a solution?

vmihailenco commented 1 year ago

I've pushed v1.5.5 that presumably fixes this.

uptrace / helm-charts

Reloading uptrace UI while using multiple pods produces different results #14