sentry-kubernetes / charts

Easily deploy Sentry on your Kubernetes Cluster
MIT License
1.08k stars 513 forks source link

[Sentry 23.10.0] Faling snuba-metrics-consumer #1042

Closed d-kononov closed 6 months ago

d-kononov commented 1 year ago

Hey guys,

We have the latest chart version and snuba-metrics-consumer stuck in CrashLoopBackOff state because of:

2023-10-18 08:16:16,746 Initializing Snuba...
2023-10-18 08:16:19,080 Snuba initialization took 2.3335538813844323s
2023-10-18 08:16:19,093 Consumer Starting
2023-10-18 08:16:19,093 Checking Clickhouse connections
2023-10-18 08:16:19,103 librdkafka log level: 6
2023-10-18 08:16:19,127 New partitions assigned: {Partition(topic=Topic(name='snuba-metrics'), index=0): 0}
2023-10-18 08:16:19,909 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 291, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 372, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 153, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 107, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 124, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 314, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 274, in join
    raise ClickhouseWriterError(message, code=code, row=row)
snuba.clickhouse.errors.ClickhouseWriterError: Method write is not supported by storage Distributed with more than one shard and no sharding key provided (version 21.8.13.6 (official build))
2023-10-18 08:16:19,911 Closing <arroyo.backends.kafka.consumer.KafkaConsumer object at 0x7f0be9383c10>...
2023-10-18 08:16:19,912 Partitions to revoke: [Partition(topic=Topic(name='snuba-metrics'), index=0)]
2023-10-18 08:16:19,912 Partition revocation complete.
2023-10-18 08:16:19,920 Processor terminated
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 279, in consumer
    consumer.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 291, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 372, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 153, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 107, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 124, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 314, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 274, in join
    raise ClickhouseWriterError(message, code=code, row=row)
snuba.clickhouse.errors.ClickhouseWriterError: Method write is not supported by storage Distributed with more than one shard and no sharding key provided (version 21.8.13.6 (official build))

Spec:

      containers:
      - command:
        - snuba
        - consumer
        - --storage
        - metrics_raw
        - --consumer-group
        - snuba-metrics-consumers
        - --auto-offset-reset
        - earliest
        - --max-batch-time-ms
        - "750"
        env:
        - name: SNUBA_SETTINGS
          value: /etc/snuba/settings.py
        - name: DEFAULT_BROKERS
          value: sentry-kafka:9092
        envFrom:
        - secretRef:
            name: sentry-snuba-env
        image: 000718100845.dkr.ecr.us-east-1.amazonaws.com/sentry:getsentry-snuba-23.10.0
        imagePullPolicy: IfNotPresent
        name: sentry-snuba
        ports:
        - containerPort: 1218
          protocol: TCP
        resources: {}
        volumeMounts:
        - mountPath: /etc/snuba
          name: config
          readOnly: true

Not sure how to fix it.

Any help appreciated!

sohahm commented 1 year ago

I am having the same issue, just deployed fresh and latest sentry chart 20.7.0 and getting this.

adonskoy commented 1 year ago

I just received the same error on one of my installations. I'll look for the cause of the error and create an issue in the sentry/snuba repository

Orpere commented 12 months ago

we have the same issue, this is stopping us from upgrading the sentry version from chart 20.5.5 version 20.8.1
the logs on sentry-snuba-metric-consumer are shown below snuba.clickhouse.errors.ClickhouseWriterError: Method write is not supported by storage Distributed with more than one shard and no sharding key provided (version 21.8.13.6 (official build))

adonskoy commented 12 months ago

Sentry team confirmed that this is a bug. See https://github.com/getsentry/snuba/issues/4897

zacharyarnaise commented 11 months ago

Hi,

A comment on getsentry/snuba suggests that there might be a workaround to resolve this issue: https://github.com/getsentry/snuba/issues/4931#issuecomment-1785674614 Could someone have a look at this and see if there's anything that could be done while we wait for the migration to be fixed?

Thanks!

acjohnson commented 11 months ago

I suspect anyone/everyone who deploys 23.10.0 with sentry-kubernetes will run in to this and it would be great to know how to deal with this rather than to just leave it this way:

sentry-snuba-metrics-consumer-6cbbdbfcc7-5ds5b                    0/1     ContainerStatusUnknown   641 (2d3h ago)   4d11h   172.27.156.14    ip-172-27-157-204.ec2.internal   <none>           <none>
sentry-snuba-metrics-consumer-6cbbdbfcc7-8fvs2                    0/1     ContainerStatusUnknown   256 (29h ago)    2d3h    172.27.143.83    ip-172-27-156-165.ec2.internal   <none>           <none>
sentry-snuba-metrics-consumer-6cbbdbfcc7-l8mt2                    0/1     CrashLoopBackOff         341 (5m1s ago)   29h     172.27.140.156   ip-172-27-149-66.ec2.internal    <none>           <none>

According to (now closed) issue comment https://github.com/getsentry/snuba/issues/4897#issuecomment-1799801423 we need to manually convert the metrics_raw clickhouse distributed table from v2 to v3. I'm not very familiar with clickhouse but will give it a shot here in the near future.

brogger71 commented 11 months ago

Does anybody know the commands to convert the metrics_raw clickhouse table? I'm also not familiar with clickhouse and honestly, i do not even know where to start. Thanks!

ET-Torsten commented 11 months ago

Does anybody know the commands to convert the metrics_raw clickhouse table? I'm also not familiar with clickhouse and honestly, i do not even know where to start. Thanks!

Same for us, some advice would be greatly appreciated.

jon-walton commented 11 months ago

This is what I did on my cluster. The usual caveats apply:


  1. Scale the snuba-metrics-consumer deployment to 0
  2. For each clickhouse instance
    clickhouse-client
    create or replace table metrics_raw_v2_dist as metrics_raw_v2_local ENGINE = Distributed('my-cluster-name', default, metrics_raw_v2_local, timeseries_id)
  3. Scale snuba-metrics-consumer deployment back up
brogger71 commented 11 months ago

Thank you @jon-walton , that seams to work!

z0rc commented 11 months ago

That didn't work for my installation:

sentry-clickhouse :) create or replace table metrics_raw_v2_dist as metrics_raw_v2_local ENGINE = Distributed('my-cluster-name', default, metrics_raw_v2_local, timeseries_id)

CREATE OR REPLACE TABLE metrics_raw_v2_dist AS metrics_raw_v2_local
ENGINE = Distributed('my-cluster-name', default, metrics_raw_v2_local, timeseries_id)

Query id: 713ed0e0-05ec-4631-864f-964a3ced90bf

0 rows in set. Elapsed: 0.008 sec.

Received exception from server (version 21.8.13):
Code: 80. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: CREATE OR REPLACE TABLE query is supported only for Atomic databases.
brogger71 commented 11 months ago

That didn't work for my installation:

sentry-clickhouse :) create or replace table metrics_raw_v2_dist as metrics_raw_v2_local ENGINE = Distributed('my-cluster-name', default, metrics_raw_v2_local, timeseries_id)

CREATE OR REPLACE TABLE metrics_raw_v2_dist AS metrics_raw_v2_local
ENGINE = Distributed('my-cluster-name', default, metrics_raw_v2_local, timeseries_id)

Query id: 713ed0e0-05ec-4631-864f-964a3ced90bf

0 rows in set. Elapsed: 0.008 sec.

Received exception from server (version 21.8.13):
Code: 80. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: CREATE OR REPLACE TABLE query is supported only for Atomic databases.

i think you need to replace "my-cluster-name" with your cluster name. You will find the cluster name in the sentry-clickhouse-config configmap under "".

z0rc commented 11 months ago

@brogger71 here is this section, it's quite default:

    <remote_servers>
        <sentry-clickhouse>
            <shard>
                <replica>
                    <internal_replication>true</internal_replication>
                    <host>sentry-clickhouse-0.sentry-clickhouse-headless.sentry.svc.cluster.local</host>
                    <port>9000</port>
                    <user>default</user>
                    <compression>true</compression>
                </replica>
            </shard>
            <shard>
                <replica>
                    <internal_replication>true</internal_replication>
                    <host>sentry-clickhouse-1.sentry-clickhouse-headless.sentry.svc.cluster.local</host>
                    <port>9000</port>
                    <user>default</user>
                    <compression>true</compression>
                </replica>
            </shard>
            <shard>
                <replica>
                    <internal_replication>true</internal_replication>
                    <host>sentry-clickhouse-2.sentry-clickhouse-headless.sentry.svc.cluster.local</host>
                    <port>9000</port>
                    <user>default</user>
                    <compression>true</compression>
                </replica>
            </shard>
        </sentry-clickhouse>
    </remote_servers>

But changing cluster name in query didn't have any effect:

sentry-clickhouse :) create or replace table metrics_raw_v2_dist as metrics_raw_v2_local ENGINE = Distributed('sentry-clickhouse', default, metrics_raw_v2_local, timeseries_id)

CREATE OR REPLACE TABLE metrics_raw_v2_dist AS metrics_raw_v2_local
ENGINE = Distributed('sentry-clickhouse', default, metrics_raw_v2_local, timeseries_id)

Query id: 3085aa1e-905e-4f81-9dec-e186c0b4a610

0 rows in set. Elapsed: 0.003 sec.

Received exception from server (version 21.8.13):
Code: 80. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: CREATE OR REPLACE TABLE query is supported only for Atomic databases.
brogger71 commented 11 months ago

@z0rc i have a default installation here as well. I used exactly the same create statement. How did you connect to clickhouse? clickhouse-client --host sentry-clickhouse-0.sentry-clickhouse-headless.errortracking.svc.cluster.local --port 9000 --user default

z0rc commented 11 months ago

How did you connect to clickhouse?

Shell into sentry-clickhouse-0 pod and run clickhouse-client -h 127.0.0.1. I didn't want to bother with installing anything else additionally.

brogger71 commented 11 months ago

How did you connect to clickhouse?

Shell into sentry-clickhouse-0 pod and run clickhouse-client -h 127.0.0.1. I didn't want to bother with installing anything else additionally.

try using with "-u default". Maybe you're connected with the wrong user.

z0rc commented 11 months ago

try using with "-u default". Maybe you're connected with the wrong user.

Nope, same error. I don't think it's related to how clickhouse accessed, but data in it.

I have this:

sentry-clickhouse :) SELECT * FROM system.databases;

SELECT *
FROM system.databases

Query id: 19e60ded-3c09-49a6-9640-0e637245555f

β”Œβ”€name────┬─engine───┬─data_path─────────────────────────┬─metadata_path───────────────────────────────────────────────────────┬─uuid─────────────────────────────────┐
β”‚ default β”‚ Ordinary β”‚ /var/lib/clickhouse/data/default/ β”‚ /var/lib/clickhouse/metadata/default/                               β”‚ 00000000-0000-0000-0000-000000000000 β”‚
β”‚ system  β”‚ Atomic   β”‚ /var/lib/clickhouse/store/        β”‚ /var/lib/clickhouse/store/b5f/b5f009a7-31fc-4c2e-9b53-04a097fe6559/ β”‚ b5f009a7-31fc-4c2e-9b53-04a097fe6559 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This is quite old installation, maybe something changed for new installations over the course of last couple years. My db just kept running as is.

acjohnson commented 11 months ago

I had to connect using the actual pod/container IP address, eg.

# clickhouse-client -h 192.168.x.x

Apparently it wasn't binding to loopback, only the actual listening interface...

ET-Torsten commented 11 months ago

I tried the way described, changed the name of the cluster, but at the end ended up with an error:

CREATE OR REPLACE TABLE metrics_raw_v2_dist AS metrics_raw_v2_local ENGINE = Distributed('sentry-clickhouse', default, metrics_raw_v2_local, timeseries_id);

Syntax error: failed at position 19 ('TABLE'):

CREATE OR REPLACE TABLE metrics_raw_v2_dist AS metrics_raw_v2_local ENGINE = Distributed('sentry-clickhouse', default, metrics_raw_v2_local, timeseries_id);

Expected VIEW

Anyone an idea why replace table does not work?

MikeW1901 commented 11 months ago

Thanks all for assistance with this (am running into the same issue). Unfortunately our Clickhouse DB is ancient enough that it's running as an Ordinary, rather than Atomic, DB type - and converting it seems impossible on a Kubernetes setup (ClickHouse have made it very straightforward by setting a flag, but when I try and restart the DB server in Clickhouse to apply it, it kills the pod instead, no matter how much resource I throw at it).

Does anyone have any wisdom about how you'd run

create or replace table metrics_raw_v2_dist as metrics_raw_v2_local ENGINE = Distributed('my-cluster-name', default, metrics_raw_v2_local, timeseries_id)

.... 'long-form' for database engines that don't support 'create or replace' as a syntax?

Thanks!

reneeckstein commented 11 months ago

@MikeW1901 Interesting, I also tried it to migrate from Ordinary to Atomic by adding a file called convert_ordinary_to_atomic to the flags directory as described here https://kb.altinity.com/engines/altinity-kb-atomic-database-engine/how-to-convert-ordinary-to-atomic/#new-official-way However, it does not even do anything, probably because our Clickhouse version (ClickHouse 21.8.13.6) is too old, but it is still the helm chart default.

Do you use a higher Clickhouse version?

mrouhi13 commented 11 months ago

This is what I did on my cluster. The usual caveats apply:

  • I've never used clickhouse, this was the result of a couple of hours in the documentation
  • I don't know what I'm doing and you don't know me πŸ˜‰
  • I didn't care about availability or data loss during this procedure (it's a new-ish cluster)
  1. Scale the snuba-metrics-consumer deployment to 0
  2. For each clickhouse instance
clickhouse-client
create or replace table metrics_raw_v2_dist as metrics_raw_v2_local ENGINE = Distributed('my-cluster-name', default, metrics_raw_v2_local, timeseries_id)
  1. Scale snuba-metrics-consumer deployment back up

Thank you @jon-walton, that worked for me! πŸ‘πŸΌ

michaelniemand commented 11 months ago

worked for me, too. to find out the name of you cluster run SELECT * FROM system.clusters LIMIT 2 FORMAT Vertical;

MikeW1901 commented 11 months ago

@MikeW1901 Interesting, I also tried it to migrate from Ordinary to Atomic by adding a file called convert_ordinary_to_atomic to the flags directory as described here https://kb.altinity.com/engines/altinity-kb-atomic-database-engine/how-to-convert-ordinary-to-atomic/#new-official-way However, it does not even do anything, probably because our Clickhouse version (ClickHouse 21.8.13.6) is too old, but it is still the helm chart default.

Do you use a higher Clickhouse version?

I do - the problem I'm having is that restarting Clickhouse Server kills the pod, and thus a new pod starts without the flag.

adonskoy commented 11 months ago

Snuba is being tested on clickhouse 22.8, but it is still not an officially supported version. You may try to update the version at your own risk. Or you can use clickhouse-copier to copy data to temp db, recreate default db as atomic and copy data back using clickhouse-copier

I also have an environment where this consumer scales to zero and I don't see any usability issues with Sentry. So this may also be a temporary solution until the migrations are fixed and/or the officially supported version of clickhouse is updated

Mokto commented 10 months ago

This issue is stale because it has been open for 30 days with no activity.

z0rc commented 10 months ago

Not stale. (I hate this bot)

Glaaj commented 9 months ago

I'm running into this problem as well, pretty annoying that its a blocker for upgrades.

acjohnson commented 9 months ago

I will say this is not really a blocker for upgrades. Once upgraded just follow the steps https://github.com/sentry-kubernetes/charts/issues/1042#issuecomment-1814347637 and you should be good to go. Just feels like a poorly managed database migration to me.

CloudAc commented 9 months ago

This issue is related to this: https://github.com/getsentry/snuba/issues/4897

Mokto commented 8 months ago

This issue is stale because it has been open for 30 days with no activity.

z0rc commented 8 months ago

For people like me with Clickhouse using Ordinary engine. Follow https://github.com/getsentry/snuba/issues/4897#issuecomment-1909077805 to delete and recreate table.

Mokto commented 7 months ago

This issue is stale because it has been open for 30 days with no activity.

Mokto commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.