Open vnagendra opened 12 months ago
I'm not seeing this behaviour regarding silent errors. When I restart rabbitmq
Vector very clearly errors with:
2023-11-21T12:35:29.232549Z ERROR lapin::io_loop: Socket was readable but we read 0. This usually means that the connection is half closed this mark it as broken
2023-11-21T12:35:29.232628Z ERROR lapin::io_loop: error doing IO error=IOError(Kind(ConnectionAborted))
2023-11-21T12:35:29.232653Z ERROR lapin::channels: Connection error error=IO error: connection aborted
message
2023-11-21T12:35:52.444637Z ERROR sink{component_kind="sink" component_id=out component_type=amqp}:request{request_id=2}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(DeliveryFailed { error: InvalidChannelState(Error) }) request_id=2 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2023-11-21T12:35:52.444731Z ERROR sink{component_kind="sink" component_id=out component_type=amqp}:request{request_id=2}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
Viewing through vector top
, errors are also increasing.
I'm setting things up via this wiki page. Perhaps you have a different setup that make a difference?
Visible errors or not though, it is definitely a problem that Vector does not reconnect when the connection drops and you are correct we should implement some logic to attempt to reconnect - and potentially reattempt delivery of a message.
I will try to reproduce after the break. My guess is that the following happened (just going by memory of the code - my setup is not much different than the wiki and I am using a standard vector/alpine container)
This is just a guess, but I'll try to reproduce. Either way I am glad we agree that we should handle it. I don't think it is too complex to handle.. Any chance we can sneak this fix in soon? :)
I need to look into it closer, but it's possible a connection pool might help here. https://crates.io/crates/deadpool-lapin
A note for the community
Problem
I have a docker-compose which has a vector instance and an AMQP sink (RabbitMQ in my case).
My configuration says "healthcheck: True", so correctly at start Vector establishes a connection and checks the health of the connection. When I send messages to the sink, everything works as expected.
If I restart only the RabbitMQ container after the first few messages, and the RabbitMQ container comes back up correctly (within a few seconds - in my case that is 3 seconds) and I send messages after, I have observed the following behavior
My hypothesis is that the channel health is not checked after the
clone
call here? https://github.com/vectordotdev/vector/blob/master/src/sinks/amqp/sink.rs#L118The library
Lapin
seems to implement this already here https://github.com/amqp-rs/lapin/blob/main/src/connection_status.rs#L66. Maybe we just need to check it?I don't see any reason this would be different in AWS ECS or any other topology. I am happy to test it if you feel it is necessary to replicate it somewhere else.
Configuration
Version
vector 0.34.1 (x86_64-unknown-linux-musl 86f1c22 2023-11-16 14:59:10.486846964)
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response