vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.48k stars 1.53k forks source link

amqp sink silently drops messages #19190

Open vnagendra opened 10 months ago

vnagendra commented 10 months ago

A note for the community

Problem

I have a docker-compose which has a vector instance and an AMQP sink (RabbitMQ in my case).

My configuration says "healthcheck: True", so correctly at start Vector establishes a connection and checks the health of the connection. When I send messages to the sink, everything works as expected.

If I restart only the RabbitMQ container after the first few messages, and the RabbitMQ container comes back up correctly (within a few seconds - in my case that is 3 seconds) and I send messages after, I have observed the following behavior

My hypothesis is that the channel health is not checked after the clone call here? https://github.com/vectordotdev/vector/blob/master/src/sinks/amqp/sink.rs#L118

The library Lapin seems to implement this already here https://github.com/amqp-rs/lapin/blob/main/src/connection_status.rs#L66. Maybe we just need to check it?

I don't see any reason this would be different in AWS ECS or any other topology. I am happy to test it if you feel it is necessary to replicate it somewhere else.

Configuration

sinks:
  console:
    inputs:
      - "*"
    type: "console"
    encoding:
      codec: "json"

  rabbit:
    inputs:
      - "*"
    type: "amqp"
    connection_string: "amqp://guest:guest@rabbit:5672/%2f?timeout=10"
    exchange: "vector.incoming"
    routing_key: "{{ source_type }}"
    encoding:
      codec: "json"

Version

vector 0.34.1 (x86_64-unknown-linux-musl 86f1c22 2023-11-16 14:59:10.486846964)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

StephenWakely commented 10 months ago

I'm not seeing this behaviour regarding silent errors. When I restart rabbitmq Vector very clearly errors with:

2023-11-21T12:35:29.232549Z ERROR lapin::io_loop: Socket was readable but we read 0. This usually means that the connection is half closed this mark it as broken
2023-11-21T12:35:29.232628Z ERROR lapin::io_loop: error doing IO error=IOError(Kind(ConnectionAborted))
2023-11-21T12:35:29.232653Z ERROR lapin::channels: Connection error error=IO error: connection aborted
message
2023-11-21T12:35:52.444637Z ERROR sink{component_kind="sink" component_id=out component_type=amqp}:request{request_id=2}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(DeliveryFailed { error: InvalidChannelState(Error) }) request_id=2 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2023-11-21T12:35:52.444731Z ERROR sink{component_kind="sink" component_id=out component_type=amqp}:request{request_id=2}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true

Viewing through vector top, errors are also increasing.

I'm setting things up via this wiki page. Perhaps you have a different setup that make a difference?

Visible errors or not though, it is definitely a problem that Vector does not reconnect when the connection drops and you are correct we should implement some logic to attempt to reconnect - and potentially reattempt delivery of a message.

vnagendra commented 10 months ago

I will try to reproduce after the break. My guess is that the following happened (just going by memory of the code - my setup is not much different than the wiki and I am using a standard vector/alpine container)

  1. You were sending data continuously
  2. Restarted Rabbit - which got the system to an "error" state and hence the system continued to throw errors. In my case what happened is the following
  3. I sent a bunch of events - system was happy as rabbits were happy
  4. Then I stopped sending events and restarted rabbit
  5. Vector had no idea that rabbits were unhappy for a brief period because there were no events to process (at all - not just the rabbit sink the system was getting no events)
  6. The connection was lost but the system never knew. But when I sent events again AFTER rabbit restart, system couldn't figure out there was a connection problem and was silently dropping messages.

This is just a guess, but I'll try to reproduce. Either way I am glad we agree that we should handle it. I don't think it is too complex to handle.. Any chance we can sneak this fix in soon? :)

StephenWakely commented 10 months ago

I need to look into it closer, but it's possible a connection pool might help here. https://crates.io/crates/deadpool-lapin