grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.31k stars 3.38k forks source link

Loki 3.0: Empty chunk cause 'Invalid chunk checksum' when querying #13634

Open honganan opened 1 month ago

honganan commented 1 month ago

Describe the bug I have a query cause 'Invalid chunk checksum' error after upgrading write and read path to 3.0.

level=error ts=2024-07-24T01:53:18.385746393Z caller=batch.go:747 org_id=fake traceID=67f0762cd67062a2 msg="error fetching chunks" err="invalid chunk checksum"

I added some debug log to print the invalid chunk's key in S3 and found the chunk is a 0 bytes one:

aws s3 ls s3://xx-xxxx-xx/fake/6ee23ab9f52a8eb1/190e1f1fa71:190e1f9627b:13453489 --human-read --summarize
2024-07-24 07:42:18    0 Bytes 190e1f1fa71:190e1f9627b:13453489

Total Objects: 1
   Total Size: 0 Bytes

This is a sub case of #8564. But I think this case is explicitly and we can fix it by skipping 0 bytes chunks first. How do you think?

The another work need to do is to digest how the write path created 0 bytes chunks.

To Reproduce Steps to reproduce the behavior:

  1. Loki 3.0 (Occurring occasionally and hard to reproduce)

Expected behavior A clear and concise description of what you expected to happen.

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

honganan commented 1 month ago

Update: After skipping the empty chunks in our version, I found that there were still Invalid chunk checksum errors. I added logging to print the chunk's key and downloaded it. When I inspected it using the chunks-inspect tool, it reported an error: error while reading metadata bytes: unexpected EOF. Seems the metadata length resolved is incorrect.

The chunk cannot share here as it is from our production environment, but I can provide any other help to digest the problem(maybe provide the problem chunk privately?).

ravishankar15 commented 2 weeks ago

The error from the chunk-inspect is a consequence of zero byte chunk as ReadFull method is expected to throw and error https://github.com/grafana/loki/blob/main/cmd/chunks-inspect/header.go#L42 may be we are trying to read the chunks created before your patch.

When you say skipping have you tried to handle the case in both write and read path ?

honganan commented 1 week ago

When you say skipping have you tried to handle the case in both write and read path ?

I only did in read path to make query works.