metrico / qryn

Polyglot All-in-One Observability w/ ClickHouse Storage. Drop-in compatible with Loki, Prometheus, Tempo, Pyroscope, Opentelemetry, Datadog & more! WASM powered ⭐️ Star to Support
https://qryn.dev
GNU Affero General Public License v3.0
1.06k stars 64 forks source link

Trace search doesn't work #529

Open KhafRuslan opened 1 week ago

KhafRuslan commented 1 week ago

After deleting the database, I updated two components, metrico-otel-collector and qryn. To versions 0.0.3 and 3.2.25 respectively. Qryn created the databases from scratch (it has full access to create databases). it created the following tables : image

I generated the traces with the following command: /root/go/bin/telemetrygen traces --otlp-endpoint qryn-host:4318 --otlp-http --traces 1 --otlp-insecure true --size 1 --workers 3

Part of the code is related to trace reception:

  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 10000
    timeout: 5s
  memory_limiter:
    check_interval: 2s
    limit_mib: 1800
    spike_limit_mib: 500
  resource:
    attributes:
    - action: upsert
      key: service.name
      value: serviceName
  resourcedetection/system:
    detectors:
    - system
    system:
      hostname_sources:
      - os
exporters:
  debug: {}
  otlp:
    endpoint: localhost:4317
    tls:
      insecure: true
  qryn:
    dsn: http://chpoxy-host.somedomain.group:9012/qryn?username=qryn&password=somepassword
    logs:
      format: raw
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 300s
      max_interval: 30s
    sending_queue:
      queue_size: 1000
    timeout: 10s

    traces:
      exporters: [qryn]
      processors: [memory_limiter, resourcedetection/system, resource, batch]
      receivers: [otlp, jaeger]

Next, I queried the generated traces via datasorce grafana tempo and also checked qryn view and got the following results: image image

I see the following error in the qryn logs:

{"level":50,"time":1720172285671,"pid":19,"hostname":"8ab58f9931b3","name":"qryn","msg":"rawRequest error: select bitShiftLeft(toUInt64(traces_idx.duration > 1000000), 0) as `cond`,count() as `count` from `qryn`.`tempo_traces_attrs_gin` as `traces_idx` where (((`date` >= '2024-07-05') and (`date` <= '2024-07-05') and (`traces_idx`.`timestamp_ns` >= 1720150685000000000) and (`traces_idx`.`timestamp_ns` < 1720172285000000000))) and () group by bitShiftLeft(toUInt64(traces_idx.duration > 1000000), 0) FORMAT JSON"}
{"level":50,"time":1720172285671,"pid":19,"hostname":"8ab58f9931b3","name":"qryn","msg":"Code: 42. DB::Exception: Function tuple requires at least one argument.: In scope SELECT bitShiftLeft(toUInt64(traces_idx.duration > 1000000), 0) AS cond, count() AS count FROM qryn.tempo_traces_attrs_gin AS traces_idx WHERE ((date >= '2024-07-05') AND (date <= '2024-07-05') AND (traces_idx.timestamp_ns >= 1720150685000000000) AND (traces_idx.timestamp_ns < 1720172285000000000)) AND tuple() GROUP BY bitShiftLeft(toUInt64(traces_idx.duration > 1000000), 0). (NUMBER_OF_ARGUMENTS_DOESNT_MATCH) (version 24.3.2.23 (official build))\n"}
{"level":50,"time":1720172285671,"pid":19,"hostname":"8ab58f9931b3","name":"qryn","reqId":"req-gm","err":"Error: Request failed with status code 500\nResponse: [500] undefined\nAxiosError: Request failed with status code 500\n    at settle (/app/node_modules/axios/dist/node/axios.cjs:1966:12)\n    at IncomingMessage.handleStreamEnd (/app/node_modules/axios/dist/node/axios.cjs:3065:11)\n    at IncomingMessage.emit (node:events:531:35)\n    at endReadableNT (node:internal/streams/readable:1696:12)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n    at Axios.request (/app/node_modules/axios/dist/node/axios.cjs:3876:41)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async rawRequest (/app/lib/db/clickhouse.js:1386:12)\n    at async evaluateComplexity (/app/traceql/index.js:61:20)\n    at async search (/app/traceql/index.js:33:20)\n    at async searchV2 (/app/lib/handlers/tempo_search.js:74:20)\n    at async Object.handler (/app/lib/handlers/tempo_search.js:25:12)","msg":"Request failed with status code 500"}
WITH sel_a AS (select `samples`.`string` as `string`,`samples`.`fingerprint` as `fingerprint`,samples.timestamp_ns as `timestamp_ns` from qryn.samples_v3_dist as `samples` where ((`samples`.`timestamp_ns`   between 1720171998408000000 and 1720172298408000000) and (`samples`.`type` in (0,0))) and (samples.fingerprint IN (select `sel_1`.`fingerprint` from (select `fingerprint` from `qryn`.`time_series_gin` where ((`key` = 'appname') and (`val` != 'kube-logs'))) as `sel_1`)) order by `timestamp_ns` desc limit 100) select JSONExtractKeysAndValues(time_series.labels, 'String') as `labels`,sel_a.* from sel_a left any join `qryn`.`time_series_dist` AS time_series on `sel_a`.`fingerprint` = time_series.fingerprint order by `labels` desc,`timestamp_ns` desc

And it's usually in the database logs:

DB::Exception: Function tuple requires at least one argument.: In scope SELECT bitShiftLeft(toUInt64(traces_idx.duration > 1000000), 0) AS cond, count() AS count FROM qryn.tempo_traces_attrs_gin AS traces_idx WHERE ((date >= '2024-07-05') AND (date <= '2024-07-05') AND (traces_idx.timestamp_ns >= 1720150538000000000) AND (traces_idx.timestamp_ns < 1720172138000000000)) AND tuple() GROUP BY bitShiftLeft(toUInt64(traces_idx.duration > 1000000), 0). (NUMBER_OF_ARGUMENTS_DOESNT_MATCH) (version 24.3.2.23 (official build)) (from host) (in query: select bitShiftLeft(toUInt64(traces_idx.duration > 1000000), 0) as cond,count() as count from qryn.tempo_traces_attrs_gin as traces_idx where (((date >= '2024-07-05') and (date <= '2024-07-05') and (traces_idx.timestamp_ns >= 1720150538000000000) and (traces_idx.timestamp_ns < 1720172138000000000))) and () group by bitShiftLeft(toUInt64(traces_idx.duration > 1000000), 0) FORMAT JSON)

the data in the database will be stored in the following form: image

Can you tell me what the error is related to?

lmangani commented 1 week ago

@KhafRuslan thanks for the report. Do other queries succeed or is the issue related to parametric queries? Does the same issue happen for any timerange size, or specifically for large ones?

KhafRuslan commented 1 week ago

thanks for the report. Do other queries succeed or is the issue related to parametric queries? Does the same issue happen for any timerange size, or specifically for large ones?

This is independent of the time range. The problem is only with trace data type. Other queries with other data types work

akvlad commented 1 week ago

@KhafRuslan I'm not sure if querying without any tags is supported. Please try specifying the service name or span name or any other tag.

KhafRuslan commented 1 week ago

I've also tried specifying a tag it doesn't fix the problem image But after a while the search inside qryn view worked, but not in grafana image Data source added correctly image I'm also confused by the way the data is displayed within clickhouse. I added a screenshot to the first post. The first three tables contain data in a strange encoding, is this how it should be ?

lmangani commented 1 week ago

@KhafRuslan which version of Grafana are you using? We're solid with 10.x (stable) while 11.x is considered still experimental.

akvlad commented 1 week ago

@KhafRuslan the tag request from grafana that you have mentioned in the previous response. Can you show the logs of the qryn instance?

KhafRuslan commented 1 week ago

I'm using version 10 image

The qryn logs :

{"level":50,"time":1720189867156,"pid":18,"hostname":"8ab58f9931b3","name":"qryn","msg":"rawRequest error: WITH index_search AS (select `trace_id` as `trace_id`,span_id as `span_id`,any(duration) as `duration`,any(timestamp_ns) as `timestamp_ns` from `qryn`.`tempo_traces_attrs_gin` as `traces_idx` where (((`date` >= '2024-07-05') and (`date` <= '2024-07-05') and (`traces_idx`.`timestamp_ns` >= 1720168266000000000) and (`traces_idx`.`timestamp_ns` < 1720189866000000000))) and (((key = 'service.name') and (val = 'serviceName')) or ((key = 'service.name') and (val = 'serviceName'))) group by `trace_id`,`span_id` having (bitAnd(groupBitOr(bitShiftLeft(toUInt64((key = 'service.name') and (val = 'serviceName')), 0)+bitShiftLeft(toUInt64((key = 'service.name') and (val = 'serviceName')), 1)) as bsCond, 1) != 0) order by `timestamp_ns` desc), index_grouped AS (select `trace_id` as `trace_id`,groupArray(100)(span_id) as `span_id` from index_search group by `trace_id` order by max(index_search.timestamp_ns) desc limit 20), trace_ids AS (select `trace_id` from index_grouped), trace_span_ids AS (select `trace_id`,`span_id` from index_grouped array join `span_id` ) select lower(hex(traces.trace_id)) as `trace_id`,arrayMap(x -> lower(hex(x)), groupArrayIf(traces.span_id, (traces.trace_id, traces.span_id) IN trace_span_ids)) as `span_id`,groupArrayIf(traces.duration_ns, (traces.trace_id, traces.span_id) IN trace_span_ids) as `duration`,groupArrayIf(traces.timestamp_ns, (traces.trace_id, traces.span_id) IN trace_span_ids) as `timestamp_ns`,min(traces.timestamp_ns) as `start_time_unix_nano`,toFloat64(max(traces.timestamp_ns + traces.duration_ns) - min(traces.timestamp_ns)) / 1000000 as `duration_ms`,argMin(traces.name, traces.timestamp_ns) as `root_service_name`,groupArrayIf(base64Encode(traces.payload), (traces.trace_id, traces.span_id) IN trace_span_ids) as `payload`,groupArrayIf(traces.payload_type, (traces.trace_id, traces.span_id) IN trace_span_ids) as `payload_type` from `qryn`.`tempo_traces_dist` as `traces` where ((traces.trace_id in (trace_ids))) group by `traces`.`trace_id` order by `start_time_unix_nano` desc limit 20 FORMAT JSON"}
{"level":50,"time":1720189867156,"pid":18,"hostname":"8ab58f9931b3","name":"qryn","msg":"Code: 8. DB::Exception: Cannot find column `groupArrayIf(__table26.span_id, in(tuple(__table26.trace_id, __table26.span_id), __set_4824305632648500369_4093749546618302003))` in source stream, there are only columns: [__table26.trace_id, groupArrayIf(__table26.span_id, in(tuple(__table26.trace_id, __table26.span_id), __set_17215724413377854956_89883972281876497)), groupArrayIf(__table26.duration_ns, in(tuple(__table26.trace_id, __table26.span_id), __set_17215724413377854956_89883972281876497)), groupArrayIf(__table26.timestamp_ns, in(tuple(__table26.trace_id, __table26.span_id), __set_17215724413377854956_89883972281876497)), min(__table26.timestamp_ns), max(plus(__table26.timestamp_ns, __table26.duration_ns)), argMin(__table26.name, __table26.timestamp_ns), groupArrayIf(base64Encode(__table26.payload), in(tuple(__table26.trace_id, __table26.span_id), __set_17215724413377854956_89883972281876497)), groupArrayIf(__table26.payload_type, in(tuple(__table26.trace_id, __table26.span_id), __set_17215724413377854956_89883972281876497))]. (THERE_IS_NO_COLUMN) (version 24.3.2.23 (official build))\n"}
{"level":50,"time":1720189867156,"pid":18,"hostname":"8ab58f9931b3","name":"qryn","reqId":"req-a5ff","err":"Error: Request failed with status code 400\nResponse: [400] undefined\nAxiosError: Request failed with status code 400\n    at settle (/app/node_modules/axios/dist/node/axios.cjs:1966:12)\n    at IncomingMessage.handleStreamEnd (/app/node_modules/axios/dist/node/axios.cjs:3065:11)\n    at IncomingMessage.emit (node:events:531:35)\n    at endReadableNT (node:internal/streams/readable:1696:12)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n    at Axios.request (/app/node_modules/axios/dist/node/axios.cjs:3876:41)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async rawRequest (/app/lib/db/clickhouse.js:1386:12)\n    at async processSmallResult (/app/traceql/index.js:124:20)\n    at async search (/app/traceql/index.js:41:11)\n    at async searchV2 (/app/lib/handlers/tempo_search.js:74:20)\n    at async Object.handler (/app/lib/handlers/tempo_search.js:25:12)","msg":"Request failed with status code 400"}
{"level":50,"time":1720189867803,"pid":18,"hostname":"8ab58f9931b3","name":"qryn","msg":"rawRequest error: WITH index_search AS (select `trace_id` as `trace_id`,span_id as `span_id`,any(duration) as `duration`,any(timestamp_ns) as `timestamp_ns` from `qryn`.`tempo_traces_attrs_gin` as `traces_idx` where (((`date` >= '2024-07-05') and (`date` <= '2024-07-05') and (`traces_idx`.`timestamp_ns` >= 1720168267000000000) and (`traces_idx`.`timestamp_ns` < 1720189867000000000))) and (((key = 'service.name') and (val = 'serviceName')) or ((key = 'service.name') and (val = 'serviceName'))) group by `trace_id`,`span_id` having (bitAnd(groupBitOr(bitShiftLeft(toUInt64((key = 'service.name') and (val = 'serviceName')), 0)+bitShiftLeft(toUInt64((key = 'service.name') and (val = 'serviceName')), 1)) as bsCond, 1) != 0) order by `timestamp_ns` desc), index_grouped AS (select `trace_id` as `trace_id`,groupArray(100)(span_id) as `span_id` from index_search group by `trace_id` order by max(index_search.timestamp_ns) desc limit 20), trace_ids AS (select `trace_id` from index_grouped), trace_span_ids AS (select `trace_id`,`span_id` from index_grouped array join `span_id` ) select lower(hex(traces.trace_id)) as `trace_id`,arrayMap(x -> lower(hex(x)), groupArrayIf(traces.span_id, (traces.trace_id, traces.span_id) IN trace_span_ids)) as `span_id`,groupArrayIf(traces.duration_ns, (traces.trace_id, traces.span_id) IN trace_span_ids) as `duration`,groupArrayIf(traces.timestamp_ns, (traces.trace_id, traces.span_id) IN trace_span_ids) as `timestamp_ns`,min(traces.timestamp_ns) as `start_time_unix_nano`,toFloat64(max(traces.timestamp_ns + traces.duration_ns) - min(traces.timestamp_ns)) / 1000000 as `duration_ms`,argMin(traces.name, traces.timestamp_ns) as `root_service_name`,groupArrayIf(base64Encode(traces.payload), (traces.trace_id, traces.span_id) IN trace_span_ids) as `payload`,groupArrayIf(traces.payload_type, (traces.trace_id, traces.span_id) IN trace_span_ids) as `payload_type` from `qryn`.`tempo_traces_dist` as `traces` where ((traces.trace_id in (trace_ids))) group by `traces`.`trace_id` order by `start_time_unix_nano` desc limit 20 FORMAT JSON"}
{"level":50,"time":1720189867803,"pid":18,"hostname":"8ab58f9931b3","name":"qryn","msg":"Code: 8. DB::Exception: Cannot find column `groupArrayIf(__table26.span_id, in(tuple(__table26.trace_id, __table26.span_id), __set_12986401009218267427_7863460299579552761))` in source stream, there are only columns: [__table26.trace_id, groupArrayIf(__table26.span_id, in(tuple(__table26.trace_id, __table26.span_id), __set_1020460723764415165_9711705387046830144)), groupArrayIf(__table26.duration_ns, in(tuple(__table26.trace_id, __table26.span_id), __set_1020460723764415165_9711705387046830144)), groupArrayIf(__table26.timestamp_ns, in(tuple(__table26.trace_id, __table26.span_id), __set_1020460723764415165_9711705387046830144)), min(__table26.timestamp_ns), max(plus(__table26.timestamp_ns, __table26.duration_ns)), argMin(__table26.name, __table26.timestamp_ns), groupArrayIf(base64Encode(__table26.payload), in(tuple(__table26.trace_id, __table26.span_id), __set_1020460723764415165_9711705387046830144)), groupArrayIf(__table26.payload_type, in(tuple(__table26.trace_id, __table26.span_id), __set_1020460723764415165_9711705387046830144))]. (THERE_IS_NO_COLUMN) (version 24.3.2.23 (official build))\n"}
{"level":50,"time":1720189867803,"pid":18,"hostname":"8ab58f9931b3","name":"qryn","reqId":"req-a5g5","err":"Error: Request failed with status code 400\nResponse: [400] undefined\nAxiosError: Request failed with status code 400\n    at settle (/app/node_modules/axios/dist/node/axios.cjs:1966:12)\n    at IncomingMessage.handleStreamEnd (/app/node_modules/axios/dist/node/axios.cjs:3065:11)\n    at IncomingMessage.emit (node:events:531:35)\n    at endReadableNT (node:internal/streams/readable:1696:12)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n    at Axios.request (/app/node_modules/axios/dist/node/axios.cjs:3876:41)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async rawRequest (/app/lib/db/clickhouse.js:1386:12)\n    at async processSmallResult (/app/traceql/index.js:124:20)\n    at async search (/app/traceql/index.js:41:11)\n    at async searchV2 (/app/lib/handlers/tempo_search.js:74:20)\n    at async Object.handler (/app/lib/handlers/tempo_search.js:25:12)","msg":"Request failed with status code 400"}
akvlad commented 1 week ago

@KhafRuslan and the version of your clickhouse?

KhafRuslan commented 1 week ago

ClickHouse server version 24.3.2.23 (official build).

akvlad commented 1 week ago

@KhafRuslan Apparently v24.3.2.23 with cluster is not supported. And the compatibility test were not run for a while.

KhafRuslan commented 1 week ago

I'll try installing a different version and I'll report back. Can you tell me what the coding of the first columns is related to, is it normal ? image

KhafRuslan commented 4 days ago

Rolled back to an older version (23.12.6). Errors now occur at any request (both logos and traces), sometimes it returns data, sometimes not. Example of error:

{
    "level": 50,
    "time": 1720426091025,
    "pid": 19,
    "hostname": "bb2a43ba4fe0",
    "name": "qryn",
    "reqId": "req-14wa",
    "err": "Error: Request failed with status code 500\nResponse: [500] {\"exception\": \"Code: 10. DB::Exception: Not found column intDiv(__table1.timestamp_ns, 1000000_UInt32) in block. There are only columns: intDiv(timestamp_ns, 1000000), arrayFilter(lambda(tuple(x), notIn(tupleElement(x, 1), '__error__')), JSONExtractKeysAndValues(labels, 'String')): While executing Remote. (NOT_FOUND_COLUMN_IN_BLOCK) (version 24.3.2.23 (official build))\"}\n{\"exception\": \"Code: 10. DB::Exception: Not found column intDiv(__table1.timestamp_ns, 1000000_UInt32) in block. There are only columns: intDiv(timestamp_ns, 1000000), arrayFilter(lambda(tuple(x), notIn(tupleElement(x, 1), '__error__')), JSONExtractKeysAndValues(labels, 'String')): While executing Remote. (NOT_FOUND_COLUMN_IN_BLOCK) (version 24.3.2.23 (official build))\"}\n\nAxiosError: Request failed with status code 500\n    at settle (/app/node_modules/axios/dist/node/axios.cjs:1966:12)\n    at RedirectableRequest.handleResponse (/app/node_modules/axios/dist/node/axios.cjs:3013:9)\n    at RedirectableRequest.emit (node:events:519:28)\n    at RedirectableRequest._processResponse (/app/node_modules/follow-redirects/index.js:398:10)\n    at RedirectableRequest._onNativeResponse (/app/node_modules/follow-redirects/index.js:91:12)\n    at Object.onceWrapper (node:events:634:26)\n    at ClientRequest.emit (node:events:519:28)\n    at HTTPParser.parserOnIncomingClient (node:_http_client:698:27)\n    at HTTPParser.parserOnHeadersComplete (node:_http_common:119:17)\n    at Socket.socketOnData (node:_http_client:540:22)\n    at Axios.request (/app/node_modules/axios/dist/node/axios.cjs:3876:41)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async queryFingerprintsScan (/app/lib/db/clickhouse.js:385:19)\n    at async Object.handler (/app/lib/handlers/query_range.js:41:22)",
    "msg": "Request failed with status code 500"
}

A query that is displayed in the database logs:

WITH sel_a AS (select samples.string as string,samples.fingerprint as fingerprint,samples.timestamp_ns as timestamp_ns from qryn.samples_v3_dist as samples where ((samples.timestamp_ns between 1720339669452000000 and 1720426069452000000) and (samples.type in (0,0))) and (samples.fingerprint in (select fingerprint from (select DISTINCT fingerprint from qryn.time_series where ((JSONHas(labels, 'context') = 1) and (JSONExtractString(labels, 'context') = '[3ba2a383-494d-3b89-b015-f3a8bb9a25df]'))))) order by timestamp_ns desc limit 10) select JSONExtractKeysAndValues(time_series.labels, 'String') as labels,sel_a.* from sel_a left any join qryn.time_series_dist AS time_series on sel_a.fingerprint = time_series.fingerprint order by labels desc,timestamp_ns desc FORMAT JSONEachRow