metrico / qryn

⭐️ All-in-One Polyglot Observability with OLAP Storage. Drop-in Grafana Cloud replacement compatible with Loki, Prometheus, Tempo, Pyroscope, Opentelemetry, Datadog & beyond!
https://qryn.dev
GNU Affero General Public License v3.0
1.12k stars 64 forks source link

premature close when querying logql #533

Closed KhafRuslan closed 1 month ago

KhafRuslan commented 1 month ago

When querying in grafana (version 10) I get different results. Qryn (like Loki) is selected as the data source. My query returns different number of rows. Similarly in Qryn view.

image

Config Qryn

services:
  otel-collector:
    container_name: otel-collector
    image: some-registry/ghcr.io-cache/metrico/qryn-otel-collector:0.0.3
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC receiver
#      - "24224:24224"
#      - "8086:8086"
      - "14250:14250"
      - "6831:6831"
      - "14268:14268"
      - "3200:3200"
      - "3100:3100"
      - "4318:4318"
      - "8888:8888"
#      - "11800:11800"
#      - "12800:12800"
#      - "8088:8088"
      - "5514:5514"
#      - "5514:3100"
      - "9411:9411"
      - "9200:9200"   # loki receiver
#      - "9300:9300"   # Elasticsearch
  qryn:
    image:  some-registry/project-cache/qxip/qryn:3.2.25
    container_name: qryn
    restart: unless-stopped
    expose:
      - 3101
    ports:
      - "80:3100"
    environment:
      CLICKHOUSE_SERVER: "chproxy-server"
      CLICKHOUSE_PORT: "8123"
      CLICKHOUSE_DB: "qryn"
      CLICKHOUSE_AUTH: "somecred"
      CLUSTER_NAME: "qryn_cluster"
      LOG_LEVEL: "debug"
      FASTIFY_BODYLIMIT: "1048576"

Qryn logs:

{"level":30,"time":1721035800909,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1k","req":{"method":"GET","url":"/loki/api/v1/query_range?direction=backward&end=1721035800562000000&limit=10&query=%7Bappname%3D%22acs1-server%22%7D&start=1721014200562000000&step=21600000ms","hostname":"qryn-test.somedomain.group","remoteAddress":"someip","remotePort":47542},"msg":"incoming request"}
{"level":20,"time":1721035800909,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1k","msg":"GET /loki/api/v1/query_range"}
{"level":20,"time":1721035800909,"pid":19,"hostname":"be0837b9417c","name":"qryn","msg":"Scanning Fingerprints..."}
WITH sel_a AS (select `samples`.`string` as `string`,`samples`.`fingerprint` as `fingerprint`,samples.timestamp_ns as `timestamp_ns` from qryn.samples_v3_dist as `samples` where ((`samples`.`timestamp_ns`   between 1721014200561000000 and 1721035800561000000) and (`samples`.`type` in (0,0))) and (`samples`.`fingerprint` in (select `fingerprint` from (select DISTINCT `fingerprint` from qryn.time_series where ((JSONHas(labels, 'appname') = 1) and (JSONExtractString(labels, 'appname') = 'acs1-server'))))) order by `timestamp_ns` desc limit 10) select JSONExtractKeysAndValues(time_series.labels, 'String') as `labels`,sel_a.* from sel_a left any join `qryn`.`time_series_dist` AS time_series on `sel_a`.`fingerprint` = time_series.fingerprint order by `labels` desc,`timestamp_ns` desc
{"level":20,"time":1721035800924,"pid":19,"hostname":"be0837b9417c","name":"qryn","msg":"Scanning Fingerprints..."}
{"level":30,"time":1721035801753,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1k","res":{"statusCode":200},"responseTime":844.1129480004311,"msg":"request completed"}
{"level":30,"time":1721035801841,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1j","res":{"statusCode":200},"msg":"stream closed prematurely"}
{"level":50,"time":1721035801842,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1j","err":"premature close\nError: premature close\n    at onclosenexttick (/app/node_modules/end-of-stream/index.js:54:86)\n    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)","msg":"premature close"}

In clickhouse, this query always returns 16 lines. I withheld the data image

lmangani commented 1 month ago

Thanks for the report @KhafRuslan have you tried bypassing chproxy to make sure it's not involved?

KhafRuslan commented 1 month ago

Similarly, it didn't solve the problem

lmangani commented 1 month ago

Ok, please provide any useful elements to reproduce this since we do not have any other similar reports this might be due to something on your system.

KhafRuslan commented 1 month ago

Operating system on all servers centos 8. Scheme: image Otel config:

receivers:
  tcplog:
    listen_address: "0.0.0.0:5514"
    operators:
    - type: json_parser
      timestamp:
        parse_from: attributes.time
        layout_type: gotime
        layout: '2006-01-02 15:04:05.999'
    - type: move
      from: attributes.newmessage
      to: body
    - type: remove
      id: time_remove
      field: attributes.time

extensions:
  health_check: []
  pprof: []
  zpages: []

processors:
  batch/logs:
    send_batch_size: 500
    timeout: 20s
  memory_limiter/logs:
    check_interval: 2s
    limit_mib: 18000
    spike_limit_mib: 500
  batch/trace:
    send_batch_size: 10000
    timeout: 0s
  memory_limiter/trace:
    check_interval: 1s
    limit_mib: 4000
    spike_limit_mib: 800

exporters:
  qryn:
    dsn: http://chproxy-host.somedomain.group:9000/qryn?username=qryn&password=somepassword
    logs:
      format: raw
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 300s
      max_interval: 30s
    sending_queue:
      queue_size: 1200
    timeout: 10s

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    logs:
      exporters: [qryn]
      processors: [memory_limiter/logs, batch/logs]
      receivers: [tcplog]
  telemetry:
    logs:
      level: "debug"
    metrics:
      address: 0.0.0.0:8888

Haproxy config:

global
    log 127.0.0.1 local2
    maxconn 1000
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    user   haproxy
    group  haproxy
    daemon
    stats socket /var/lib/haproxy/stats
defaults
    log global
    mode tcp
    retries 2
    timeout connect 4s

listen otel_write
    balance roundrobin
    bind *:9000
    default-server inter 10s fall 3 rise 3 on-marked-down shutdown-sessions
      server qryn-db1 10.80.172.125:9000
      server qryn-db2 10.80.172.126:9000
      server qryn-db3 10.80.172.127:9000
      server qryn-db4 10.80.172.124:9000

Chproxy config:

log_debug: true
hack_me_please: true

caches:
  - name: "default_cache"
    mode: "file_system"
    file_system:
      max_size: 2Gb
      dir: "/data/cache"
    max_payload_size: 2Gb
    shared_with_all_users: true
    expire: 1h

server:
  http:
    listen_addr: "0.0.0.0:8123"

users:
  - name: "qryn"
    password: "somepassword"
    to_cluster: "qryn_cluster"
    to_user: "qryn"
    max_concurrent_queries: 1000
    max_execution_time: 10m
    cache: "default_cache"

clusters:
  - name: "qryn_cluster"
    nodes: ["qryn-db1:8123", "qryn-db2:8123", "qryn-db3:8123", "qryn-db4:8123"]
    users:
      - name: "qryn"
        password: "somepassword"

Clackhouse settings cluster.xml:

<clickhouse>
     <remote_servers replace="1">
       <qryn_cluster> <!-- you need to give a some name for a cluster -->
           <!-- Inter-server per-cluster secret for Distributed queries default: no secret (no authentication will be performed) -->
           <secret>SuperSecretValue</secret>
           <shard>
               <internal_replication>true</internal_replication>
               <replica>
                   <host>qryn-db1</host>
                   <port>9000</port>
               </replica>
               <replica>
                   <host>qryn-db2</host>
                   <port>9000</port>
               </replica>
           </shard>
           <shard>
               <internal_replication>true</internal_replication>
               <replica>
                   <host>qryn-db3</host>
                   <port>9000</port>
               </replica>
               <replica>
                   <host>qryn-db4</host>
                   <port>9000</port>
               </replica>
           </shard>
       </qryn_cluster>
    </remote_servers>
</clickhouse>

database_atomic_delay_before_drop_table.xml:

<clickhouse>
    <database_atomic_delay_before_drop_table_sec>1</database_atomic_delay_before_drop_table_sec>
</clickhouse>

disable_drop_restrictions.xml:

<clickhouse>
    <max_table_size_to_drop>0</max_table_size_to_drop>
    <max_partition_size_to_drop>0</max_partition_size_to_drop>
</clickhouse>

disable_unused_ports.xml:

<clickhouse>
    <!-- To reduce exposure to your ClickHouse environment, unused ports should be disabled. -->
    <mysql_port remove="1" />
    <postgresql_port remove="1" />
</clickhouse>

listen.xml:

<clickhouse>
    <!--ipv4-->
    <listen_host>0.0.0.0</listen_host>
</clickhouse>

logging.xml:

<clickhouse>
  <logger replace="1">
    <level>information</level>
    <log>/var/log/clickhouse-server/clickhouse-server.log</log>
    <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
    <size>300M</size>
    <count>5</count>
  </logger>
</clickhouse>

macros.xml:

<clickhouse>
     <!--
        That macros are defined per server,
        and they can be used in DDL, to make the DB schema cluster/server neutral
    -->
    <macros>
        <cluster>qryn_cluster</cluster>
        <shard>01</shard>
        <replica>urt-qryn-db1</replica>
    </macros>
</clickhouse>

zookeeper.xml:

<clickhouse>
     <zookeeper>
        <node>
            <host>qryn-zookeper.somedomain.group</host>
            <port>2181</port>
        </node>
    </zookeeper>
</clickhouse>

Zookeper conf:

tickTime=2000
initLimit=20
syncLimit=5
dataDir=/u/zookeeper/data
clientPort=2181
maxClientCnxns=0
autopurge.snapRetainCount=5
autopurge.purgeInterval=1
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
4lw.commands.whitelist=stat
server.1=qryn-zookeper:2888:3888
lmangani commented 1 month ago

Thanks for the full detail overview. Have you noticed any potentially related logs or errors in ClickHouse while this issue manifests?

KhafRuslan commented 1 month ago

No

KhafRuslan commented 1 month ago

I've been seeing errors like this before: 2024.07.16 14:41:58.926544 [ 15764 ] {c73cd6c6-4513-493b-8f66-0cfbb7c88939} HedgedConnectionsFactory: Connection failed at try №1, reason: Code: 210. DB::NetException: I/O error: Broken pipe, while writing to socket (db3:40594 -> db1:9000). (NETWORK_ERROR) (version 23.12.6.19 (official build))

but it seems to be related to something else

KhafRuslan commented 1 month ago

There was a disynchronization between databases, it was solved through several database recreations. I'll see how he behaves in another 24 hours. I also upgraded to the latest version image

lmangani commented 1 month ago

Thanks for the update @KhafRuslan this info is particularly useful as we couldn't reproduce this so far