Open MartinEmrich opened 4 months ago
Would be great if someone could look into it.
Would that also drop all other log events contained in that request, not only the streams loki won't accept? Currently it looks that way.
You are correct, the entire request is dropped. Only dropping the entry with the invalid timestamp would be good improvement. It looks like the response from Loki does include the invalid entry so seemingly Vector could drop just that entry and retry.
I don't know when we'll get to implementing this though. PRs would be welcome.
Thinking about for a few days, I formed somewhat of an opinion:
I would assume Vector to re-use the buffering/batching code across all similar sinks, and it is most probably not a data structure optimized for selective deletion. So the obvious fix (deleting exactly the unaccepted events from the batch and submitting it again, all which further events are queuing up) might be quite expensive.
In my specific case, the offending log events were a day old, I still try to wrap my head around how a log event could survive this long within just two vectors without persistence (one as a daemon set on the K8s cluster collecting with the kubernetes source, and a second aggregator stage writing then to Loki, both w/o disk buffers).
For more normal cases, Loki actually has a max-age window of 2h, so it will accept events +-1h out-of order.
IMHO Loki should drop too-old events and keep the rest, this should not be the job of the submitting component (Vector).
I now wonder, what's the best short term fix:
Will do some experiments next.....
I would assume Vector to re-use the buffering/batching code across all similar sinks, and it is most probably not a data structure optimized for selective deletion. So the obvious fix (deleting exactly the unaccepted events from the batch and submitting it again, all which further events are queuing up) might be quite expensive.
That assumption is accurate: a change like this could be quite invasive. We have a similar need of it for the Elasticsearch sink to handle partial failures: https://github.com/vectordotdev/vector/issues/140.
- What is the Loki protocol exactly, would disabling batching make it explode? If it would keep persistent connections with pipelining and a lean binary protocol, that could work. But if it's HTTP/JSON based, sending HTTP requests headers (bigger than the log event itself) with each event will most likely crash.
By default it sends protobuf messages over HTTP, but can be configured to send JSON per https://vector.dev/docs/reference/configuration/sinks/loki/#request-encoding
- I think about dropping too-old events (or re-timestamping and marking them) within VRL... Are there some best practices in place?
That's a good thought, you could do something like that with a remap
transform and VRL. First check if the timestamp is too old (compare with now()
) and then either reset it or drop the event.
A note for the community
Problem
I noticed a quite sudden drop in log volume. Looking at both Vector and Loki logs, I notices tons of "400 Bad Request" in the loki sink log of vector, and tons of "too late timestamp" messages in the Loki log.
So for reasons unknown, the Vector tries to deliver some logs with yesterdays timestamp (way beyond expectations for any reordering within vector).
But my impression is: Vector buffers and sends logs in big chunks. I assume Loki will reject such a chunked request if one single log line violates it's timestamp monotony requirements. Now that whole request will be discarded:
Would that also drop all other log events contained in that request, not only the streams loki won't accept? Currently it looks that way.
Configuration
Version
timberio/vector:0.39.0-distroless-libc
Debug Output
No response
Example Data
No response
Additional Context
No response
References
Related topic: https://github.com/vectordotdev/vector/issues/5024