amqp sink silently drops messages

vnagendra commented 12 months ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I have a docker-compose which has a vector instance and an AMQP sink (RabbitMQ in my case).

My configuration says "healthcheck: True", so correctly at start Vector establishes a connection and checks the health of the connection. When I send messages to the sink, everything works as expected.

If I restart only the RabbitMQ container after the first few messages, and the RabbitMQ container comes back up correctly (within a few seconds - in my case that is 3 seconds) and I send messages after, I have observed the following behavior

Vector correctly receives the messages, as evidenced by the stdout output
There are no errors on the console or anywhere
However the messages get dropped silently and don't get delivered

My hypothesis is that the channel health is not checked after the clone call here? https://github.com/vectordotdev/vector/blob/master/src/sinks/amqp/sink.rs#L118

The library Lapin seems to implement this already here https://github.com/amqp-rs/lapin/blob/main/src/connection_status.rs#L66. Maybe we just need to check it?

I don't see any reason this would be different in AWS ECS or any other topology. I am happy to test it if you feel it is necessary to replicate it somewhere else.

Configuration

sinks:
  console:
    inputs:
      - "*"
    type: "console"
    encoding:
      codec: "json"

  rabbit:
    inputs:
      - "*"
    type: "amqp"
    connection_string: "amqp://guest:guest@rabbit:5672/%2f?timeout=10"
    exchange: "vector.incoming"
    routing_key: "{{ source_type }}"
    encoding:
      codec: "json"

Version

vector 0.34.1 (x86_64-unknown-linux-musl 86f1c22 2023-11-16 14:59:10.486846964)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

StephenWakely commented 11 months ago

I'm not seeing this behaviour regarding silent errors. When I restart rabbitmq Vector very clearly errors with:

2023-11-21T12:35:29.232549Z ERROR lapin::io_loop: Socket was readable but we read 0. This usually means that the connection is half closed this mark it as broken
2023-11-21T12:35:29.232628Z ERROR lapin::io_loop: error doing IO error=IOError(Kind(ConnectionAborted))
2023-11-21T12:35:29.232653Z ERROR lapin::channels: Connection error error=IO error: connection aborted
message
2023-11-21T12:35:52.444637Z ERROR sink{component_kind="sink" component_id=out component_type=amqp}:request{request_id=2}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(DeliveryFailed { error: InvalidChannelState(Error) }) request_id=2 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2023-11-21T12:35:52.444731Z ERROR sink{component_kind="sink" component_id=out component_type=amqp}:request{request_id=2}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true

Viewing through vector top, errors are also increasing.

I'm setting things up via this wiki page. Perhaps you have a different setup that make a difference?

Visible errors or not though, it is definitely a problem that Vector does not reconnect when the connection drops and you are correct we should implement some logic to attempt to reconnect - and potentially reattempt delivery of a message.

vnagendra commented 11 months ago

I will try to reproduce after the break. My guess is that the following happened (just going by memory of the code - my setup is not much different than the wiki and I am using a standard vector/alpine container)

You were sending data continuously
Restarted Rabbit - which got the system to an "error" state and hence the system continued to throw errors. In my case what happened is the following
I sent a bunch of events - system was happy as rabbits were happy
Then I stopped sending events and restarted rabbit
Vector had no idea that rabbits were unhappy for a brief period because there were no events to process (at all - not just the rabbit sink the system was getting no events)
The connection was lost but the system never knew. But when I sent events again AFTER rabbit restart, system couldn't figure out there was a connection problem and was silently dropping messages.

This is just a guess, but I'll try to reproduce. Either way I am glad we agree that we should handle it. I don't think it is too complex to handle.. Any chance we can sneak this fix in soon? :)

StephenWakely commented 11 months ago

I need to look into it closer, but it's possible a connection pool might help here. https://crates.io/crates/deadpool-lapin

vectordotdev / vector