metrico / qryn

⭐️ All-in-One Polyglot Observability with OLAP Storage for Logs, Metrics, Traces & Profiles. Drop-in Grafana Cloud replacement compatible with Loki, Prometheus, Tempo, Pyroscope, Opentelemetry, Datadog and beyond :rocket:
https://qryn.dev
GNU Affero General Public License v3.0
1.18k stars 66 forks source link

Qryn crash under load - ERR_STREAM_PREMATURE_CLOSE #516

Open jpsfs opened 3 months ago

jpsfs commented 3 months ago

Hi!

I'm facing an issue while using PromQL through qryn.

The query is the following:

histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{service_namespace=~"$environment",service_name=~"$component",  instance=~"$instance",http_route=~"$route", http_request_method=~"$method"}[$__rate_interval])) by (le))

It translates to this ClickHouse query:

WITH idx AS (select `fingerprint` from `qryn`.`time_series_gin` as `time_series_gin` where ((((`key` = 'service_namespace') and (match(val, '.+') = 1)) or ((`key` = 'service_name') and (match(val, '.+') = 1)) or ((`key` = 'instance') and (match(val, '.+') = 1)) or ((`key` = 'http_route') and (match(val, '.+') = 1)) or ((`key` = 'http_request_method') and (match(val, '.+') = 1)) or ((`key` = '__name__') and (`val` = 'http_server_request_duration_seconds_bucket'))) and (`date` >= toDate(fromUnixTimestamp(1718310540))) and (`date` <= toDate(fromUnixTimestamp(1718312340))) and (`type` in (0,0))) group by `fingerprint` having (groupBitOr(bitShiftLeft(((`key` = 'service_namespace') and (match(val, '.+') = 1))::UInt64, 0)+bitShiftLeft(((`key` = 'service_name') and (match(val, '.+') = 1))::UInt64, 1)+bitShiftLeft(((`key` = 'instance') and (match(val, '.+') = 1))::UInt64, 2)+bitShiftLeft(((`key` = 'http_route') and (match(val, '.+') = 1))::UInt64, 3)+bitShiftLeft(((`key` = 'http_request_method') and (match(val, '.+') = 1))::UInt64, 4)+bitShiftLeft(((`key` = '__name__') and (`val` = 'http_server_request_duration_seconds_bucket'))::UInt64, 5)) = 63)), raw AS (select argMaxMerge(last) as `value`,`fingerprint`,intDiv(timestamp_ns, 15000000000) * 15000 as `timestamp_ms` from `metrics_15s` as `metrics_15s` where ((`fingerprint` in (idx)) and (`timestamp_ns` >= 1718310540000000000) and (`timestamp_ns` <= 1718312340000000000) and (`type` in (0,0))) group by `fingerprint`,`timestamp_ms` order by `fingerprint`,`timestamp_ms`), timeSeries AS (select `fingerprint`,arraySort(JSONExtractKeysAndValues(labels, 'String')) as `labels` from `qryn`.`time_series` where ((`fingerprint` in (idx)) and (`type` in (0,0)))) select any(labels) as `stream`,arraySort(groupArray((raw.timestamp_ms, raw.value))) as `values` from raw as `raw` any left join timeSeries as time_series on `time_series`.`fingerprint` = raw.fingerprint group by `raw`.`fingerprint` order by `raw`.`fingerprint`

And after a few seconds it crashes with the following error:

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at Gunzip.onclose (node:internal/streams/end-of-stream:154:30)
    at Gunzip.emit (node:events:531:35)
    at emitCloseNT (node:internal/streams/destroy:147:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21)
Emitted 'error' event on Readable instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  code: 'ERR_STREAM_PREMATURE_CLOSE'
}

Running this query directly in ClickHouse, it returns in around 150ms, with a total size of 50MiB. Any pointers on what I should do to overcome this?

Best, José

lmangani commented 3 months ago

Thanks for the report @jpsfs Do you see any logs or errors from ClickHouse as this query fails?

jpsfs commented 3 months ago

Thank you for the follow-up @lmangani ! Only normal ClickHouse logs, the query itself doesn't seem to fail on ClickHouse and if I try to execute it manually it succeeds quite fast.

I forgot to mention that this was tested in the latest version (released today) as well as in the previous two versions.

If the database is smaller (less data) the query succeeds on qryn as well.

Best, José

akvlad commented 3 months ago

@jpsfs do you have an error message: "timeout" in Grafana when you request histogram_quantile(0.50, ....) ?

jpsfs commented 3 months ago

Yes, as qryn crashes and the request in grafana gets a timeout.

Best,

On Fri, Jun 14, 2024, 11:14 akvlad @.***> wrote:

@jpsfs https://github.com/jpsfs do you have an error message: "timeout" in Grafana when you request histogram_quantile(0.50, ....) ?

— Reply to this email directly, view it on GitHub https://github.com/metrico/qryn/issues/516#issuecomment-2167708398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH6GOOGWMVIJITDP7MGT5TZHK67VAVCNFSM6AAAAABJJF65TCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXG4YDQMZZHA . You are receiving this because you were mentioned.Message ID: @.***>

lmangani commented 3 months ago

@jpsfs could you please retest using the latest release and provide any feedback?

EXPERIMENTAL_PROMQL_OPTIMIZE==1
akvlad commented 3 months ago

Hello @jpsfs . 3.2.24 version is released.

Please set the env var EXPERIMENTAL_PROMQL_OPTIMIZE=1 before usage. Please share the user experience of using sum and rate functions (like in your histogram_quantile.... request) so we can decide if the further optimizations are worth to be done.