Open vmalloc opened 2 years ago
FYI @Geal , I managed to reproduce this very easily. Having a single producer and a single consumer against a locally running pulsar (I used https://github.com/vmalloc/pulsar-cli to do that):
[2022-02-15T10:28:06Z DEBUG pulsar::connection] Connecting to pulsar://127.0.0.1: 127.0.0.1:6650
[2022-02-15T10:28:06Z DEBUG pulsar::consumer] consumer engine stopped: Err(ServiceDiscovery(Connection(Io(Os { code: 61, kind: ConnectionRefused, message: "Connection refused" }))))
[2022-02-15T10:28:06Z ERROR pulsar::consumer] could not close consumer Some("pulsar-cli")(14550240535650196156) for topic test2: Disconnected
After that, the consumer stream continuously yields None
as the message, even though the server recovers pretty quickly.
@Geal I'm considering diving into this issue myself to help resolve it. Just making sure - will you be available for PR approval if the need arises?
@vmalloc Hi! I'll be available for PR approval if you can resolve this issue! Thanks
Somewhat related to #164, I'm experiencing a similar hang but on the consumer side. Using the pulsar consumer as a
Stream
, it seems like after certain occasions in which ZK nodes fail (in my case after experiencing errors), the pulsar cluster recovers just fine, but the client becomes stuck, hanging the entire task that callednext()
. This is especially severe because runningtry_next
in aselect!
loop like I do actually calls the entire select branch to hang, meaning the reactive loop stops responding to waking futures.The logs don't say much except for these lines that get repeated several times around the time of the hang:
Restarting the service altogether (causing a complete reconnect) seems to have fixed it completely, so there must be a bug in the reconnection process here I guess....
EDIT: I'm using the latest version of the client library (4.1.1)