grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.41k stars 3.39k forks source link

Loki Query Slowness #5803

Open singh9vj opened 2 years ago

singh9vj commented 2 years ago

Hi

We are designing Centralinzed Logging system using Loki, Currently We are ingesting 10TB/day data in loki, and using Dynamo and S3 as Backend, The search operations are very slow, are there any best practices to speed up queries.

We are planning to ingest 100TB of data per day in future so in this way it will not work. I am afraid with the scaling point.

Can anyone please Suggest some solution.

Thanks in advance.

liguozhong commented 2 years ago

ref pr : https://github.com/grafana/loki/pull/5455 ref issue : https://github.com/grafana/loki/issues/5405

cyriltovena commented 2 years ago

how do you deploy Loki ? for 10TB/day you need to have the distributed deployment.

singh9vj commented 2 years ago

Yes we have followed distributed deployment architecture and running loki on Ec2 instances using binary.

Below is the architecture.

2 Box for Distributor 3 Boxes for ingester+ruler 2 Boxes for Querier+Queryfrontend

All of above boxes are in Auto scaling groups in AWS and increases with increase in CPU.

Attached is the Loki Configuration. loki-conf.txt

cyriltovena commented 2 years ago

Can you share statistics of a query?

splitice commented 2 years ago

I can second pr #5455 as very helpful for cases like this.

It's not perfect. But works in a pinch.

Not using it at the 100tb/day (aiming for 500gb/day) level though.

ankitnayan commented 2 years ago

@singh9vj which types of queries are you running?

Any details in the list of queries run would be helpful in answering the details.

singh9vj commented 2 years ago

We run all types of queries like searching strings from logs for one week range, and avg, sum, rate and other analysis queries which are very slower on nginx logs which are being pushed at rate of 30k to 60k logs lines per sec.

ankitnayan commented 2 years ago

We run all types of queries like searching strings from logs for one week range, and avg, sum, rate and other analysis queries which are very slower on nginx logs which are being pushed at rate of 30k to 60k logs lines per sec.

@singh9vj can you provide an example log line and search filters used if you are not running string search, rate, or sum on all of the data ingested during the above period. Eg, filtering by serviceName, api, statusCode, etc. This would help in checking how fast the filters can work and reduce the scope of loglines to aggregate on. Without filters, with around 50K loglines/s becomes around 52B logines for a week. Using filters can decrease this search space drastically.

singh9vj commented 2 years ago

Hi Below is the sample query logs.

level=info ts=2022-06-29T06:50:07.438992096Z caller=metrics.go:122 component=frontend org_id=xyz latency=slow query="sum by (http_status_code) (count_over_time({job=\"nginx\"}\n| pattern <syslog_timestamp> <timestamp8601> <http_status_code> <program>[<pid>]: <client_ip>:<client_port> [<accept_date>] <frontend_name> <backend_name> <some more pattern>\n#| __error__=\"\" [1s]))" query_type=metric range_type=range length=24h0m0s step=1m0s duration=56.723002396s status=200 limit=1023 returned_lines=0 throughput=836MB total_bytes=47GB queue_time=0s subqueries=289

ankitnayan commented 2 years ago

Throughput doesn't seem bad. How many CPUs and memory are allocated to the querier?

sakirma commented 2 years ago

Yes we have followed distributed deployment architecture and running loki on Ec2 instances using binary.

Below is the architecture.

2 Box for Distributor 3 Boxes for ingester+ruler 2 Boxes for Querier+Queryfrontend

All of above boxes are in Auto scaling groups in AWS and increases with increase in CPU.

Attached is the Loki Configuration. loki-conf.txt

@singh9vj could you explain why split_queries_by_interval is set to 0? I believe that this can cause a problem with how your queriers are going to work at the end

singh9vj commented 2 years ago

ok, that was updated later, its value is 5m currently.

Moep90 commented 1 year ago

@singh9vj any update regarding the "slowness"?

khanh96le commented 1 year ago

hi @singh9vj, do you have any update on this issue?

SusannahZ commented 10 months ago

Any update regarding it? So interested in the 10TB of daily logs , how/what are Loki configuration setup/tuned for the capacity?

omers commented 10 months ago

👀

splitice commented 10 months ago

In our case the slowness that we experienced was largely due to low parallelisation on large dense (high event/s) chunks.

While large chunks are necessary for good fetching performance they result in poor resource utilisation with the blocks being largely processed sequentially. A large chunk of say 256MB covering an a day is great, the querierier easily eliminates blocks and lines not of interest reducing the work without needing to do expensive line filters. But a large chunk of say 256MB covering 3 hours can be slow to query on just a single core (series processing...).

Initially to resolve this we used liguozhong's PR #5455 to parallise querying within the chunk. This worked well, however the implementation was not without its issues.

Now we are experimenting with changing our label structure (https://github.com/vectordotdev/vector/issues/19246) to create more parallelisation strictly when the chunks would otherwise be too dense. This also helps balance the ingestion work a lot. It does reduce the compression rate a bit and also increase the number of chunks created.

This isnt a solution that will be suitable for everyone as its external to loki.

For reference many of our queries need to scan 50GB+ of chunks (typically they are scanning 1 weeks data). We have 9 querier nodes running on a mix of 2 and 4 CPU core VMs. We have a S3 HTTP cache but still need to keep the number of chunks fetched and stored to appropriate levels to prevent rate limiting by the backing store. The rates we have chosen in our lua script result in the typical peaks creating an additional 6x chunks during these peaks (typically there is already some sharding taking place). We intend to further increase our chunk size once this new data structure is fully verified to further decrease the number of chunks generated for "slow" streams as we previously had to limit chunk size to help balance load during peaks.

Loki could probably implement sharding like this transparently through automatic chunk sharding at a configurable event rate at the distributor level. This would protentially result in more consistent performance than the current default behaviour.