Loki sink: out-of-order logs might cause loss of other in-order logs?

MartinEmrich commented 4 months ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I noticed a quite sudden drop in log volume. Looking at both Vector and Loki logs, I notices tons of "400 Bad Request" in the loki sink log of vector, and tons of "too late timestamp" messages in the Loki log.

So for reasons unknown, the Vector tries to deliver some logs with yesterdays timestamp (way beyond expectations for any reordering within vector).

But my impression is: Vector buffers and sends logs in big chunks. I assume Loki will reject such a chunked request if one single log line violates it's timestamp monotony requirements. Now that whole request will be discarded:

2024-06-21T09:05:52.140971Z ERROR sink{component_kind="sink" component_id=loki component_type=loki}:request{request_id=2592}: vector::sinks::util::retries: Non-retriable error; dropping the request. error=Server responded with an error: 400 Bad Request internal_log_rate_limit=true
2024-06-21T09:05:52.141070Z ERROR sink{component_kind="sink" component_id=loki component_type=loki}:request{request_id=2592}: vector_common::internal_event::service: Internal log [Service call failed. No retries or retries exhausted.] has been suppressed 1 times.
2024-06-21T09:05:52.141073Z ERROR sink{component_kind="sink" component_id=loki component_type=loki}:request{request_id=2592}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(ServerError { code: 400 }) request_id=2592 error_type="request_failed" stage="sending" internal_log_rate_limit=true

Would that also drop all other log events contained in that request, not only the streams loki won't accept? Currently it looks that way.

Configuration

sinks:
      loki:
        type: loki
        inputs:
          - input1
          - input2
        encoding:
          codec: json
        out_of_order_action: accept
        labels:
          "*": "{{ loki_labels }}"
          "platform": "{{ platform }}"
        remove_label_fields: true
        endpoint: ...
        batch:
          max_bytes: 1048576
          max_events: 1000
          timeout_secs: 1
        buffer:
          when_full: block
          type: disk
          max_size: 1073741824
        healthcheck:
          enabled: true
          path: ...
          port: ...

Version

timberio/vector:0.39.0-distroless-libc

Debug Output

No response

Example Data

No response

Additional Context

No response

References

daddyrex1 commented 4 months ago

Would be great if someone could look into it.

jszwedko commented 4 months ago

Would that also drop all other log events contained in that request, not only the streams loki won't accept? Currently it looks that way.

You are correct, the entire request is dropped. Only dropping the entry with the invalid timestamp would be good improvement. It looks like the response from Loki does include the invalid entry so seemingly Vector could drop just that entry and retry.

I don't know when we'll get to implementing this though. PRs would be welcome.

MartinEmrich commented 4 months ago

Thinking about for a few days, I formed somewhat of an opinion:

I would assume Vector to re-use the buffering/batching code across all similar sinks, and it is most probably not a data structure optimized for selective deletion. So the obvious fix (deleting exactly the unaccepted events from the batch and submitting it again, all which further events are queuing up) might be quite expensive.

In my specific case, the offending log events were a day old, I still try to wrap my head around how a log event could survive this long within just two vectors without persistence (one as a daemon set on the K8s cluster collecting with the kubernetes source, and a second aggregator stage writing then to Loki, both w/o disk buffers).

For more normal cases, Loki actually has a max-age window of 2h, so it will accept events +-1h out-of order.

IMHO Loki should drop too-old events and keep the rest, this should not be the job of the submitting component (Vector).

I now wonder, what's the best short term fix:

What is the Loki protocol exactly, would disabling batching make it explode? If it would keep persistent connections with pipelining and a lean binary protocol, that could work. But if it's HTTP/JSON based, sending HTTP requests headers (bigger than the log event itself) with each event will most likely crash.
I think about dropping too-old events (or re-timestamping and marking them) within VRL... Are there some best practices in place?

Will do some experiments next.....

jszwedko commented 4 months ago

I would assume Vector to re-use the buffering/batching code across all similar sinks, and it is most probably not a data structure optimized for selective deletion. So the obvious fix (deleting exactly the unaccepted events from the batch and submitting it again, all which further events are queuing up) might be quite expensive.

That assumption is accurate: a change like this could be quite invasive. We have a similar need of it for the Elasticsearch sink to handle partial failures: https://github.com/vectordotdev/vector/issues/140.

What is the Loki protocol exactly, would disabling batching make it explode? If it would keep persistent connections with pipelining and a lean binary protocol, that could work. But if it's HTTP/JSON based, sending HTTP requests headers (bigger than the log event itself) with each event will most likely crash.

By default it sends protobuf messages over HTTP, but can be configured to send JSON per https://vector.dev/docs/reference/configuration/sinks/loki/#request-encoding

I think about dropping too-old events (or re-timestamping and marking them) within VRL... Are there some best practices in place?

That's a good thought, you could do something like that with a remap transform and VRL. First check if the timestamp is too old (compare with now()) and then either reset it or drop the event.

vectordotdev / vector