metrico / qryn

⭐️ All-in-One Polyglot Observability with OLAP Storage for Logs, Metrics, Traces & Profiles. Drop-in Grafana Cloud replacement compatible with Loki, Prometheus, Tempo, Pyroscope, Opentelemetry, Datadog and beyond :rocket:
https://qryn.dev
GNU Affero General Public License v3.0
1.24k stars 68 forks source link

Clickhouse Overcommit error #578

Open tsearle opened 1 month ago

tsearle commented 1 month ago

Qryn threw the following error Code: 241. DB::Exception: Memory limit (total) exceeded: would use 7.21 GiB (attempt to allocate chunk of 4476253 bytes), maximum: 7.20 GiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker.: (while reading column labels): (while reading from part data/3bebf7bf-6abc-4a92-ae16-834c90cbafbc/20240925_0_24268_13230/ in table qryn.time_series (3bebf7bf-6abc-4a92-ae16-834c90cbafbc) located on disk s3WithKeeperDisk of type s3, from mark 0 with max_rows_to_read = 8192): While executing MergeTreeSelect(pool: PrefetchedReadPool, algorithm: Thread). (MEMORY_LIMIT_EXCEEDED) (version 24.5.1.22957 (official build))

When digging on the clickhouse console, the query that failed appears to be... WITH idx AS (select fingerprint from qryn.time_series_gin as time_series_gin where ((((key = 'name') and (val = 'kamailio_app_server_dead'))) and (date >= toDate(fromUnixTimestamp(1727294240))) and (date <= toDate(fromUnixTimestamp(1727294540))) and (type in (0,0))) group by fingerprint having (groupBitOr(bitShiftLeft(((key = 'name') and (val = 'kamailio_app_server_dead'))::UInt64, 0)) = 1)), raw AS (select argMaxMerge(last) as value,fingerprint,intDiv(timestamp_ns, 15000000000) * 15000 as timestamp_ms from metrics_15s as metrics_15s where ((fingerprint in (idx)) and (timestamp_ns >= 1727294240000000000) and (timestamp_ns <= 1727294540000000000) and (type in (0,0))) group by fingerprint,timestamp_ms order by fingerprint,timestamp_ms), timeSeries AS (select fingerprint,arraySort(JSONExtractKeysAndValues(labels, 'String')) as labels from qryn.time_series where ((fingerprint in (idx)) and (type in (0,0)))) select any(labels) as stream,arraySort(groupArray((raw.timestamp_ms, raw.value))) as values from raw as raw any left join timeSeries as time_series on time_series.fingerprint = raw.fingerprint group by raw.fingerprint order by raw.fingerprint FORMAT RowBinary

This caused the prometheus query (for alarm monitoring) to fail: kamailio_app_server_dead{}

Am I doing something stupidly wrong?

akvlad commented 1 month ago

Hello @tsearle . Let's run a couple of helping requests in order to understand where the bottleneck is:

select count(), uniq(fingerprint) from (SELECT fingerprint
        FROM time_series_gin AS time_series_gin
        WHERE ((key = 'name') AND (val = 'kamailio_app_server_dead')) AND (date >= toDate(fromUnixTimestamp(1727294240))) AND (date <= toDate(fromUnixTimestamp(1727294540))) AND (type IN (0, 0))
        GROUP BY fingerprint
        HAVING groupBitOr(bitShiftLeft(CAST((key = 'name') AND (val = 'kamailio_app_server_dead'), 'UInt64'), 0)) = 1)
WITH
    idx AS
    (
        SELECT fingerprint
        FROM time_series_gin AS time_series_gin
        WHERE ((key = 'name') AND (val = 'kamailio_app_server_dead')) AND (date >= toDate(fromUnixTimestamp(1727294240))) AND (date <= toDate(fromUnixTimestamp(1727294540))) AND (type IN (0, 0))
        GROUP BY fingerprint
        HAVING groupBitOr(bitShiftLeft(CAST((key = 'name') AND (val = 'kamailio_app_server_dead'), 'UInt64'), 0)) = 1
    ),
    raw AS
    (
        SELECT
            argMaxMerge(last) AS value,
            fingerprint,
            intDiv(timestamp_ns, 15000000000) * 15000 AS timestamp_ms
        FROM metrics_15s AS metrics_15s
        WHERE (fingerprint IN (idx)) AND (timestamp_ns >= 1727294240000000000) AND (timestamp_ns <= 1727294540000000000) AND (type IN (0, 0))
        GROUP BY
            fingerprint,
            timestamp_ms
        ORDER BY
            fingerprint ASC,
            timestamp_ms ASC
    )
select count() from raw;

Please share the results of both of them. There should be a set of numbers about the overall cardinality you have.

tsearle commented 1 month ago

first query count() 0 uniq(fingerprint) 0

2nd query count() 0

are the queries correct?

akvlad commented 1 month ago

@tsearle please try doing USE qryn; before queries.

tsearle commented 1 month ago

to my knowledge this with use qryn (the qryn db is selected)

Screenshot 2024-09-27 at 14 22 02

explicitly adding "use qryn" doesn't change the result

Screenshot 2024-09-27 at 14 23 30

akvlad commented 1 month ago

@tsearle in this case the request you have posted should use 0MB of RAM and finish in almost no time. There's no way it is the cause.

Can you please recheck the full request in the issue description manually?

akvlad commented 1 month ago

@tsearle I have just understood what's wrong. It's Github markdown. Please try These requests instead.

select count(), uniq(fingerprint) from (SELECT fingerprint
        FROM time_series_gin AS time_series_gin
        WHERE ((key = '__name__') AND (val = 'kamailio_app_server_dead')) AND (date >= toDate(fromUnixTimestamp(1727294240))) AND (date <= toDate(fromUnixTimestamp(1727294540))) AND (type IN (0, 0))
        GROUP BY fingerprint
        HAVING groupBitOr(bitShiftLeft(CAST((key = '__name__') AND (val = 'kamailio_app_server_dead'), 'UInt64'), 0)) = 1)
WITH
    idx AS
    (
        SELECT fingerprint
        FROM time_series_gin AS time_series_gin
        WHERE ((key = '__name__') AND (val = 'kamailio_app_server_dead')) AND (date >= toDate(fromUnixTimestamp(1727294240))) AND (date <= toDate(fromUnixTimestamp(1727294540))) AND (type IN (0, 0))
        GROUP BY fingerprint
        HAVING groupBitOr(bitShiftLeft(CAST((key = '__name__') AND (val = 'kamailio_app_server_dead'), 'UInt64'), 0)) = 1
    ),
    raw AS
    (
        SELECT
            argMaxMerge(last) AS value,
            fingerprint,
            intDiv(timestamp_ns, 15000000000) * 15000 AS timestamp_ms
        FROM metrics_15s AS metrics_15s
        WHERE (fingerprint IN (idx)) AND (timestamp_ns >= 1727294240000000000) AND (timestamp_ns <= 1727294540000000000) AND (type IN (0, 0))
        GROUP BY
            fingerprint,
            timestamp_ms
        ORDER BY
            fingerprint ASC,
            timestamp_ms ASC
    )
select count() from raw;

Kind request for the future to use triple backtics when you paste the request so it's considered "code". Github processor eats parts of text like these double underscores.

Sorry for the inconvenience.

gabrielgontariu commented 1 month ago

@akvlad attached results ![Uploading Screenshot 2024-09-30 at 11.07.25.png…]()

lmangani commented 1 month ago

@gabrielgontariu @tsearle the screenshot didn't work unfortunately

tsearle commented 1 month ago

Screenshot_20240930-102326.png

Screenshot_20240930-102411.png

Roman2dot0 commented 1 month ago

https://clickhouse.com/docs/en/operations/settings/query-complexity#settings-max_bytes_before_external_group_by