moxious / triage

testing triage actions for issues
0 stars 1 forks source link

Error Feedback for LOKI3.0 #348

Open tonypowa opened 3 months ago

tonypowa commented 3 months ago

What happened?

I'm encountering timeouts sometimes when querying recent logs within the last three hours, despite monitoring a log file using the LOKI 3.0 logging system (consisting of promtail, loki, and grafana), with the log file generating approximately 50MB of data per day and not exceeding 50,000 records.Sometimes I can query a day's worth of records, but other times I can't even query an hour's worth. How can I resolve this issue?

Loki.yaml:

auth_enabled: false

server:

http_listen_port: 3100

common:

ring:

instance_addr: 127.0.0.1

kvstore:

store: inmemory

replication_factor: 1

path_prefix: /opt/app/loki

schema_config:

configs:

storage_config:

filesystem:

directory: /opt/app/loki/chunks

limits_config:

reject_old_samples: true

reject_old_samples_max_age: 72h

table_manager:

retention_deletes_enabled: true

retention_period: 30d

promtail.yaml

server:

http_listen_port: 9080

grpc_listen_port: 0

positions:

filename: /var/log/positions.yaml # This location needs to be writeable by promtail.

sync_period: 60s

ignore_invalid_yaml: true # 这行被注释掉了,如果需要可以取消注释

client:

url: http://localhost:3100/loki/api/v1/push

url: http://172.16.16.157:3100/loki/api/v1/push

tenant_id: tenant1

scrape_configs:

The time range I searched for is from 12:00 to 12:10 on May 6, 2024. This query range failed to retrieve results, but when I searched for the past 6 hours, including the mentioned time period, the query was successful! Why is that? How can this be resolved?

What did you expect to happen?

Time Out

1

Did this work before?

grafana 10.0+

How do we reproduce it?

1.

2.

3.

Is the bug inside a dashboard panel?

No response

Environment (with versions)?

Grafana:

OS:

Browser:

Grafana platform?

None

Datasource(s)?

No response

tonypowa commented 3 months ago

This issue is a test copy of an issue in another repo. Original issue: https://github.com/grafana/grafana/issues/#87383

moxious commented 3 months ago

Summary: User tonypowa is experiencing intermittent timeouts when querying recent logs within a three-hour window using LOKI 3.0, despite a reasonable log file size. Queries for longer periods sometimes work, while shorter ones may fail, and the user provided their configuration files for assistance with resolving the issue.

moxious commented 3 months ago

Hello @tonypowa, it seems like you're experiencing an issue with querying logs in Loki. To better assist you, I've directed your issue to the appropriate team. The Observability Logs Project seems to be the most suitable home for this issue because it relates to problems with the Loki logging system, which falls under the observability domain in Grafana.

/team Observability Logs

A member of the team will follow up soon. Meanwhile, you might want to check if there are known issues with the query ranges in the logs system or consider adjusting the query timeouts/configurations.

moxious commented 3 months ago

Elaboration:

Thank you for the detailed report, however, to effectively address your timeouts with the LOKI 3.0 system there are several more pieces of information that would be helpful. Could you please provide the following details to assist in troubleshooting your issue further?

  1. Grafana, OS, and Browser Versions:

    • You've mentioned that Grafana 10.0+ worked previously, can you confirm the current version of Grafana you're using?
    • What is the operating system and its version where the LOKI and Grafana are deployed?
    • Which browser and which version of it are you using when you encounter these issues?
  2. Reproduction Steps:

    • You've left the reproduction steps blank. Can you provide step-by-step instructions on how to reproduce this issue?
  3. Error Messages and Logs:

    • Can you provide server-side logs from LOKI around the time when the timeout occurs?
    • Are there any proxy or network logs available that could provide insights into the communication between Grafaleft emptyft and LOKI?
  4. Dashboard Panel Issues:

    • It appears the timeout error screenshot links may be broken. Can you confirm they are correct or re-upload the images?
    • Can you tell us if there are any particular dashboard panels that consistently show the time-out issue, or if this occurs in different panels?
  5. Resource Usage Metrics:

    • What are the CPU and memory resource usage for the LOKI service during the occurrence of timeout issues?
  6. Configuration Detail Validation:

    • It is important to validate if the configurations are correctly applied. Can you verify that the loki.yaml and promtail.yaml configurations you posted are indeed the ones being used by the services?
  7. Query Patterns:

    • Are there specific queries or patterns you've noticed that tend to cause these timeouts more frequently than others?

Once you provide this additional information, it will be much easier to investigate the problems you're facing more thoroughly and offer potential solutions.