grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.72k stars 3.43k forks source link

[performance] label: takes 25 seconds whene cardinality >= 10w #6243

Open liguozhong opened 2 years ago

liguozhong commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

/loki/api/v1/label?start=1653399718498000000&end=1653403318498000000

The slow label http handler of loki leads to a poor experience in grafana loki explore, it takes 25 seconds to load the label prompt box on the left.

level=debug ts=2022-05-25T03:16:47.274713123Z caller=series_index_store.go:95 org_id=1662_qamopdln traceID=46b14075ea76f47d series-ids=64129

level=debug ts=2022-05-25T03:16:47.275711308Z caller=series_index_store.go:390 org_id=1662_qamopdln traceID=46b14075ea76f47d msg="post intersection" matchers=1 ids=64191

code : pkg/querier/querier.go:352 func (q SingleTenantQuerier) Label(ctx context.Context, req logproto.LabelRequest) (*logproto.LabelResponse, error) { }

pkg/storage/stores/series/series_index_store.go:220 func (c *indexStore) LabelNamesForMetricName(ctx context.Context, userID string, from, through model.Time, metricName string) ([]string, error) { }

pkg/storage/stores/series/series_index_store.go:502 func (c *indexStore) lookupEntriesByQueries(ctx context.Context, queries []index.Query) ([]index.Entry, error) { err := c.index.QueryPages(ctx, queries, func(query index.Query, resp index.ReadBatchResult) bool {

} }


To Reproduce Steps to reproduce the behavior:

  1. Started Loki (SHA or version)
  2. Started Promtail (SHA or version) to tail '...'
  3. Query: {} term

Expected behavior A clear and concise description of what you expected to happen.

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem. image

image

image

slow image

honganan commented 2 years ago

Our scenario do not have so much labels, but label scan is also the performance bottleneck. When query in big tenant, one query shard need to scan over 20k chunkIDs from Cassandra and it take seconds time. The CPU max usage metric is reaching 100% instantaneous.

I am thinking of if we can split by time shards then compress and store ids for every stream.

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.