metrico / qryn

⭐️ All-in-One Polyglot Observability with OLAP Storage for Logs, Metrics, Traces & Profiles. Drop-in Grafana Cloud replacement compatible with Loki, Prometheus, Tempo, Pyroscope, Opentelemetry, Datadog
https://qryn.dev
GNU Affero General Public License v3.0
1.17k stars 64 forks source link

Slow log request performance #539

Open deathalt opened 1 month ago

deathalt commented 1 month ago

log format:

{"body":"body","traceid":"1","spanid":"1","severity":"Information","flags":1,"attributes":{"AccountId":1,"ActionId":"UUID","ActionName":"Controller","Stat":"1","Id":"2","ConnectionId":"123","GId":0,"ParentId":"UUID","RequestId":"UUID","RequestPath":"/path","SpanId":"1323aff603f45021","TraceId":"03016d03700e03ec8b2892038bacfd92","{OriginalFormat}":"INIT"},"resources":{"deployment.environment":"Stage","service.instance.id":"UUID","service.name":"Service","team":"Team","telemetry.sdk.language":"dotnet","telemetry.sdk.name":"opentelemetry","telemetry.sdk.version":"1.9.0"},"instrumentation_scope":{"name":"Controller"}}

Loki request:

{level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= 85577686 | json body="body", Environment="resources[\"deployment.environment\"]", Scope="instrumentation_scope[\"name\"]" | Environment = Stage

Took more then 30 seconds and crush then.

{level=~"WARN|INFO|ERROR|FATAL", job="Service"}

works fine.

{level=~"WARN|INFO|ERROR|FATAL", job="Service"} | json body="body"

7 days gap -> 15.2 seconds response time, "Logs volume" in grafana 502

{level=~"WARN|INFO|ERROR|FATAL", job="Service"} | json body="body", Environment="resources[\"deployment.environment\"]", Scope="instrumentation_scope[\"name\"]" | Environment =Stage``

7 days gap -> 30 seconds response time -> 502 Client.Timeout exceeded while awaiting headers

Any chance to speed it up?

akvlad commented 1 month ago

Hello @deathalt

sum by (job) (count_over_time({level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= `85577686` [1d]))

can you count this over the last 7 days?

akvlad commented 1 month ago

Another question is I see no use of

body="body", Scope="instrumentation_scope[\"name\"]"

do you really need to extract these two labels?

deathalt commented 1 month ago

@akvlad

sum by (job) (count_over_time({level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= 85577686 [1d]))

took 1.83 mins and fails then

image

Another question is I see no use of body="body", Scope="instrumentation_scope[\"name\"]"

I cannot answer, because it is one our dev use case, and I think he just want to get this label in log body

akvlad commented 1 month ago

I have to know the amount of data we deal with so I can reproduce it locally.

@deathalt what about

sum by (job) (count_over_time({level=~"WARN|INFO|ERROR|FATAL", job="Service"} [1d]))

?

deathalt commented 1 month ago

@akvlad

image

1.59s

akvlad commented 1 month ago

@deathalt Clickhouse is definitely unable to parse about 2.5B rows in 1 minute. Please increase the timeouts untill

sum by (job) (count_over_time({level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= ``85577686`` [1d]))

starts passing. Then please try something like:

{level=~"WARN|INFO|ERROR|FATAL", job="Service"} |= `85577686` |~ `"deployment.environment":\\s*"Stage"`

Thanks for the information. I'll try to ingest a comparable amount of data and try the request.

deathalt commented 1 month ago

@akvlad

I don't understand what the problem is. In my settings

  FASTIFY_BODYLIMIT: 15242880
  FASTIFY_REQUESTTIMEOUT: 0
  FASTIFY_MAXREQUESTS: 0

but I still get

Get “http://qryn-read.qryn.svc:3100/loki/api/v1/query_range?
net/http: timeout awaiting response headers (Client.Timeout exceeded while awaiting headers)

~1.5 minutes after the request

sum by (job) (count_over_time({level=~“WARN|INFO|ERROR|FATAL”, job=“Service”} |=85577686[1d])))

lmangani commented 1 month ago

@deathalt no errors in the qryn logs preceding this response? Just to be sure could you also confirm you have no proxy between the client and qryn?

deathalt commented 1 month ago

@lmangani

│ qryn-read-5bcf8468c5-mvqf4 {"level":30,"time":1722422596334,"pid":19,"hostname":"qryn-read-5bcf8468c5-mvqf4","name":"qryn","reqId":"req-1w","res":{"statusCode":200},"msg":"stream closed prematurely"}                                                                                                                                               │
│ qryn-read-5bcf8468c5-mvqf4 {"level":50,"time":1722422596334,"pid":19,"hostname":"qryn-read-5bcf8468c5-mvqf4","name":"qryn","reqId":"req-1w","err":"premature close\nError: premature close\n    at onclosenexttick (/app/node_modules/end-of-stream/index.js:54:86)\n    at process.processTicksAndRejections (node:internal/process/task_queues:77:1 │
│ 1)","msg":"premature close"} 

only this

clickhouse -> consul -> qryn

no proxy at all

lmangani commented 1 month ago

Is consul providing the connectivity when the queries run? I'm not familiar with it. If it is involved with providing networking are you sure it doesn't introduce any session timers between client/server?

deathalt commented 1 month ago

Ok, there was default clickhouse timeouts

Here you can find result of

sum by (job) (count_over_time({level=~“WARN|INFO|ERROR|FATAL”, job=“Service”} |= 85577686[1d])))

image

lmangani commented 1 month ago

@deathalt thanks for the update. Which ClickHouse setting was affecting this? I'll add it to the configuration notes. Did this positively affect any of the previously reported failures as a side-effect?

akvlad commented 1 month ago

@deathalt so you're trying to filter about 350 rows out of 3.3B . Mmmm... There are not a lot of ways to improve the raw power of grep. Maybe you can manually configure skip indexes on samples_v3 table. It can improve LIKE '%.....%' of |= 85577686 part

It's an interesting experiment to import 1B of rows and check the difference between skipIndexes and nothing.