grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.58k stars 3.41k forks source link

query_ingesters_within Config not right may cause query result left out logs #6043

Open honganan opened 2 years ago

honganan commented 2 years ago

Describe the bug I have a log line which have several labels like: application=myapp, source=error.log, and it contains some keywords:

2022-04-28 06:11:31.239, TID:221a40a1baa748cbbeeea45b2af4cbb2.393.16511262905704693, ERROR, com.xxx.xxx.xxx.dao.BaseDAO, BaseDAO.java, 223, add id clash id=6131027,trynum=3,e=java.sql.SQLIntegrityConstraintViolationException,....

When choose only labels without pipeline, I can query it out. But when add pipeline with keyword contains in the log, the result is empty.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki (2.4.2)
  2. Config write component with:
    chunk_idle_period: 3h 
    chunk_retain_period: 30s 
    max_chunk_age: 2h
  3. Config querier and query-frontend component with:
    query_store_only: false
    query_ingesters_within: 15m
  4. execute 2 queries: 1st: {application="myapp",source="error.log"} 2nd: {application="myapp",source="error.log"} |= "SQLIntegrityConstraintViolationException"

Expected behavior Expect both 2 queries returns the log line, but only the 1st did. After about 2 hours letter, both of them can return the log line.

Environment:

honganan commented 2 years ago

I guess it's because the stream produces logs slow and the chunk is not flushed when I query, and query_ingesters_within configs to query from Ingester only within recent 15 minutes. So the logs between now-2h to now-15m cannot be queried. I want to confirm am I alright?

If that's the problem, should we do some check to the config values and give some warning?

cyriltovena commented 2 years ago

cc @sandeepsukhani

sandeepsukhani commented 2 years ago

I guess it's because the stream produces logs slow and the chunk is not flushed when I query, and query_ingesters_within configs to query from Ingester only within recent 15 minutes. So the logs between now-2h to now-15m cannot be queried. I want to confirm am I alright?

If that's the problem, should we do some check to the config values and give some warning?

Yes, you got it right. The query_ingesters_within should be at least max_chunk_age which defaults to 2h. We should add some validation for this and similar other configs where Loki operators can do mistakes which can be avoided by checks.

duj4 commented 7 months ago

Hi @sandeepsukhani ,

It seems I met the similar issue, I am running Loki in K8S with SSD mode, the max_chunk_age is set as default 2h and query_ingesters_within is set as 3h. I ingested a log snippet which was generated at March 1st :

2024-03-01 02:52:27,703+0800 jira-stats-0 INFO [c.a.j.util.stats.JiraStats] [JIRA-STATS] [NODE-START] snapshot stats: duration=PT1H0.007S, statsOverhead=n/a, data={"_statsName":"NODE-START","_statsType":"snapshot","_time":"2024-02-29T18:52:27.702Z","_timestamp":1709232747702,"_duration":"PT1H0.007S","_invocations":0,"_statsOverhead":"n/a","configuration":{},"checkIndexOnStart":{"result":"NOT_RUN","timeInSeconds":0},"getIndexBy":{"rebuildLocalIndex":{"result":"NOT_RUN","timeInSeconds":0},"pickIndexSnapshotFromSharedHome":{"result":"NOT_RUN","timeInSeconds":0},"requestIndexSnapshotFromAnotherNode":{"result":"NOT_RUN","timeInSeconds":0},"performFullForegroundReindex":{"result":"NOT_RUN","timeInSeconds":0}},"ensureFreshIndexSnapshot":{"result":"NOT_RUN","snapshotExisted":false,"snapshotCreated":false,"timeInSeconds":0}} 2024-03-01 02:52:27,704+0800 jira-stats-0 INFO [c.a.j.util.stats.JiraStats] [JIRA-STATS] [NODE-START] total stats: duration=PT205H6M19.18S, statsOverhead=n/a, data={"_statsName":"NODE-START","_statsType":"total","_time":"2024-02-29T18:52:27.702Z","_timestamp":1709232747702,"_duration":"PT205H6M19.18S","_invocations":0,"_statsOverhead":"n/a","configuration":{},"checkIndexOnStart":{"result":"NOT_RUN","timeInSeconds":0},"getIndexBy":{"rebuildLocalIndex":{"result":"NOT_RUN","timeInSeconds":0},"pickIndexSnapshotFromSharedHome":{"result":"NOT_RUN","timeInSeconds":0},"requestIndexSnapshotFromAnotherNode":{"result":"NOT_RUN","timeInSeconds":0},"performFullForegroundReindex":{"result":"NOT_RUN","timeInSeconds":0}},"ensureFreshIndexSnapshot":{"result":"NOT_RUN","snapshotExisted":false,"snapshotCreated":false,"timeInSeconds":0}} 2024-03-01 02:52:29,578+0800 plugin-transaction-0 INFO [c.a.jira.plugin.PluginTransactionListener] [JIRA-STATS] [plugin-transaction] numberStartEvents:627, numberEndEvents:627, numberSendEvents:278, numberEventsInTransactions:12196, numberOfPluginEnableEvents:276 2024-03-01 02:52:43,445+0800 Caesium-1-2 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Start to check for index consistency. 2024-03-01 02:52:43,446+0800 Caesium-1-2 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Index consistency check finished. indexConsistent: true; timeToCheckIndexConsistency: 0 ms 2024-03-01 02:52:43,446+0800 Caesium-1-2 INFO ServiceRunner [c.a.jira.servlet.ApplicationStateResolverImpl] Checking index consistency. Time taken: 1.343 ms

When I tried to query it from Grafana, it gave me the error like below: image

But if I ingested the log snippet generated just now, everything looks normal:

2024-03-06 16:32:14,374+0800 Caesium-1-1 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Start to check for index consistency. 2024-03-06 16:32:14,380+0800 Caesium-1-1 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Index consistency check finished. indexConsistent: true; timeToCheckIndexConsistency: 2 ms 2024-03-06 16:32:14,380+0800 Caesium-1-1 INFO ServiceRunner [c.a.jira.servlet.ApplicationStateResolverImpl] Checking index consistency. Time taken: 5.981 ms 2024-03-06 16:33:14,375+0800 Caesium-1-4 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Start to check for index consistency. 2024-03-06 16:33:14,379+0800 Caesium-1-4 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Index consistency check finished. indexConsistent: true; timeToCheckIndexConsistency: 0 ms 2024-03-06 16:33:14,380+0800 Caesium-1-4 INFO ServiceRunner [c.a.jira.servlet.ApplicationStateResolverImpl] Checking index consistency. Time taken: 3.970 ms

Per the doc:

query_ingesters_within: Maximum lookback beyond which queries are not sent to ingester

So it means if the query time range (7 days) is out of the scope of query_ingesters_within(3h), Loki will not search ingesters at all but the backend storage. However, since it is still in the window of max_chunk_age (2h) or chunk_idle_period (30min) , the logs in memory is not flushed to storage yet, then I failed to search the logs from either memory or storage.

Once I set the query_ingesters_within to 0, the elder log files are searchable again.

Please correct me if there is any misunderstanding.

Thanks.

honganan commented 3 months ago

So it means if the query time range (7 days) is out of the scope of query_ingesters_within(3h), Loki will not search ingesters at all but the backend storage. However, since it is still in the window of max_chunk_age (2h) or chunk_idle_period (30min) , the logs in memory is not flushed to storage yet, then I failed to search the logs from either memory or storage.

Once I set the query_ingesters_within to 0, the elder log files are searchable again.

Please correct me if there is any misunderstanding.

The max_chunk_age config is like a time window, any logs out of [now - max_chunk_age/2, now] will be discarded. This behavior can be observed via loki_discarded_samples_total metrics.

So I think your problem is logs generated in March 1st is outside the window and be discarded, not the query config.