Open hamburml opened 3 months ago
/cc @alesj (kafka), @cescoffier (kafka), @matejvasek (amazon-lambda), @ozangunalp (kafka), @patriot1burke (amazon-lambda)
AFAIK, kafka clients have no way of suspend/resume to support CRac. There is KIP-921 for that. In theory, suspend should close connections (not the client itself) and resume would need to reconnect to a node.
AFAIK, kafka clients have no way of suspend/resume to support CRac
That's sad but not what I mean. When my resumed Lambda sends a message via Kafka, I simply have an exception in the log, the Kafka client reconnects and tries the same message again. Nothing is lost. What is more annoying is the fact that the Lambda is often shutdown, so it starts with SnapStart, loads the state into memory and then we have the exception again. It feels dirty and unclean. I only need a simple method which tells kafka client to close all connections, I would call it in the beforeCheckpoint and maybe in afterRestore tell kafka to connection again.
btw AWS SnapStart only uses the interfaces for CRaC but is not a CRaC implementation. It is just very similar.
edit
In short:
In theory, suspend should close connections (not the client itself) and resume would need to reconnect to a node.
would be helpful :D
@ozangunalp What's the status here? Did we documented the pause/resume mechanism we discussed?
Describe the bug
Hi,
we use snapstart on our quarkus lambdas. Some of them use smallrye-messaging to write or receive messages from a kafka. This works as expected unfortunately in our logs we have some warnings that the connection to a kafka node was lost either to auth error or firewall blocking.
Afaik during the init phase the whole memory of a started quarkus lambda is stored and when the lambda is reused reloaded into the memory to skip the init phase. That also means that pooled connections are "stored" but in reality are already closed.
Now I thought i simply need to close all open kafka connections before the snapshot is created. I did this with a org.crac.Resource and the beforeCheckpoint method. Now the warnings in the log are gone but it looks like no new connections are initiated and therefore all messages send via a channel fail. I also used KafkaProducer::flush but that didnt help.
Any ideas?
I found https://github.com/quarkusio/quarkus/issues/31401 which is the same issue but with database connections.
Expected behavior
No response
Actual behavior
No response
How to Reproduce?
No response
Output of
uname -a
orver
No response
Output of
java -version
No response
Quarkus version or git rev
No response
Build tool (ie. output of
mvnw --version
orgradlew --version
)No response
Additional information
No response