shayhatsor / zookeeper

Apache ZooKeeper .NET async Client
https://nuget.org/packages/ZooKeeperNetEx/
Apache License 2.0
236 stars 53 forks source link

ConnectionReset exception #45

Open HolyPrapor opened 3 years ago

HolyPrapor commented 3 years ago

This client was used for a long time on Windows without any issues. A couple of month ago we tried to use this client on .NET Core and we tested it on Linux and Windows.

In our project we use ZooKeeperClient a lot to read nodes and set watchers.

Windows version works flawlessly. However, Linux version causes Connection reset by peer exception. I investigated this problem and read Zookeeper logs. I found out that Zookeeper didn't reset it's connection. I didn't capture any tcp dumps, but I'm pretty sure there are no TCP RST packets.

Upgrading to .NET 5 makes the situation even worse. (ConnectionLossExceptions appear more often).

I decided to go deeper into the ZooKeeperClient code. I found a check which causes false-detected connection loss.

Unfortunately, I was not able to detect what causes this effect and how to reproduce this problem. Looks like a problem with sockets on Linux.

Removing this check solves the problem.

Also, this client sends KeepAlive pings anyway, so if there IS a real connection loss, we will know about it in a soon time (either next time we try to send something or next ping).

HolyPrapor commented 3 years ago

According to SO the most proper way to check if a socket is connected is to check if there any bytes available to read and call Poll method of the socket. This PR resolves the issue with sockets.

MatsKarlsson commented 3 years ago

Been having the same problem, org.apache.zookeeper.KeeperException.ConnectionLossException started to appear more frequently in .netcore3, but when trying to upgrade to .NET5 I get it all the time.

Using MacOS and Big Sur.

kuskmen commented 3 years ago

I am afraid we also started facing the same issue, would be nice if PR is reviewed and released if it solves the issue.

douggish commented 2 years ago

We upgraded to .NET 6 from .NET Core 3.1 and see this very frequently now when running within a linux docker container.

madelson commented 1 year ago

Can the code be changed to follow the guidance from the MSFT docs for checking connected?

// .Connect throws an exception if unsuccessful
client.Connect(anEndPoint);

// This is how you can determine whether a socket is still connected.
bool blockingState = client.Blocking;
try
{
    byte [] tmp = new byte[1];

    client.Blocking = false;
    client.Send(tmp, 0, 0);
    Console.WriteLine("Connected!");
}
catch (SocketException e)
{
    // 10035 == WSAEWOULDBLOCK
    if (e.NativeErrorCode.Equals(10035))
    {
        Console.WriteLine("Still Connected, but the Send would block");
    }
    else
    {
        Console.WriteLine("Disconnected: error code {0}!", e.NativeErrorCode);
    }
}
finally
{
    client.Blocking = blockingState;
}

Console.WriteLine("Connected: {0}", client.Connected);