mfenniak / rethinkdb-net

A C# / .NET client driver for RethinkDB.
Other
247 stars 37 forks source link

Concurrent async calls timing out #216

Open andoband opened 9 years ago

andoband commented 9 years ago

I am running up against connection errors (timeouts mostly) when using rethinkdb-net in an ASP.Net MVC app asynchronously.

public async Task<ActionResult> IndexPost()
{
    using (var conn = await ConnectionFactory.GetAsync())
        await conn.RunAsync(tUsers.Get("some-user-id"));            
    return Json(new { success = true });
}

If I call this from a webpage one-at-a-time, everything works fine, but if I throw a few concurrent calls at it then rethinkdb requests start timing out. When the timeout happens the rethink connection that timed out will fail each time it's used from that point on (sync or async). This only happens when I have the rethinkdb-net connection pool turned on, so it must be something to do with reusing a connection (or the underlying networking).

I set up a test project pinging the MVC action above via 10 concurrent ajax requests using jquery (although jquery or the browser seemed to be sending only around 6 calls simultaneously).

I found a simple workaround was to request the connection synchronously instead of asynchronously, i.e. ConnectionFactory.Get() instead of ConnectionFactory.GetAsync(). I could then successfully call conn.RunAsync(query) on that connection from then on out.

I dug into the rethink-db code and found that putting a ConfigureAwait(false) on Line 200 of Connection.cs solved the problem.

await tcpClient.ConnectAsync(endpoint.Address, endpoint.Port).ConfigureAwait(false);

I see in a previous commit (https://github.com/mfenniak/rethinkdb-net/commit/d22f3c49f92e80e495062b824a640d081607f3b1), that you added ConfigureAwaits for all awaits in Connection.cs but then removed them.

Maybe this is one that could be put back in?

Note: I was testing against a RethinkDB instance I set up on Azure (not a local install). Same behavior happened when the website was deployed to Azure as well.

mfenniak commented 9 years ago

Hi @andoband. Sorry for the slow response, but I've been on vacation and had limited Internet access.

The ConfigureAwait fix is likely not the correct solution. It was investigated while looking into issue #130, but, a more sophisticated solution that properly used the .NET synchronization contexts rather than discarding them was eventually developed.

Can you please provide a the test project demonstrating the problem that you mention using? That would help looking into this issue.

andoband commented 9 years ago

No worries on the "slow" response...I appreciate the reply.

I've put up the test project here. It's a page with two buttons...one for executing sync code and one for async code. If you run the Sync tests first, then the connection pool fills up with healthy connections. So to reproduce what I'm seeing (hopefully) just push the async button. When pushed it will send 20 requests to the server and report back the results.

I saw the changes you made in response to issue #130 using TaskUtilities.ExecuteSynchronously(), but when I've bumped into these issues in the past, the ConfigureAwait() approach was the recommended solution. But I'm no expert.

andoband commented 9 years ago

Unfortunately, I started tunneling my connection to the rethink server over ssh and I saw the timeouts start happening again...even with the ConfigureAwait(false) added to the tcpClient.ConnectAsync() call. There is a lot of information on stackoverflow and on Stephen Cleary's blog about how those deadlocks happen when you are synchronously waiting (blocking) on the result of an async call (like you experienced with issue #130), so I thought I'd try to find if that was happening in my situation.

I looked a little harder at Connection.cs, and I didn't find any explicit task.Result or task.Wait() calls, but the call to ReadLoop is synchronous and looked suspicious because of that. ReadLoop eventually ends up awaiting on the current SynchronizationContext with the call to Protocol.ReadResponseFromStream(). Since the call to ReadLoop (in DoTryConnect()) is not being awaited, execution continues and the newly created connection is passed back to user code where I immediately run a query asynchronously which awaits results on the same SynchronizationContext. I'm guessing the query request is holding onto the SynchronizationContext and the Protocol.ReadResponseFromStream() call is getting blocked because of it (or vice-versa)...but I could be wrong.

Having found that out, I moved the ConfigureAwait(false) from the tcpClient.TryConnect() call to the Protocol.ReadResponseFromStream() call. That did the trick, so maybe I found the cause of the issue.

Another solution I tried and which worked was to queue the call to ReadLoop onto the ThreadPool using ThreadPool.QueueUserWorkItem(_ => ReadLoop()). I'm not as fond of this solution as much since it may start pressuring the ASP.Net ThreadPool a bit, especially since ReadLoop queues work on it as well.

Something else I tried was bypassing ReadLoop altogether and leaving all that thread management to Asp.Net and the .Net Framework. In this case, I commented out the call to ReadLoop and instead just await the response normally after sending the request. In other words, in InternalRunQuery() I got rid of the TaskCompletionSource code, the use of tokenResponse dictionary, the spin lock, and replaced the return await tcs.Task; with return await Protocol.ReadResponseFromStream(stream, Logger); Doing that slims down the code and it worked great...at least for standard calls from an ASP.Net app.

I suspected that monitoring data changes might need that ReadLoop, so I ran the unit tests and sure enough, I had problems with the RealtimePushTests. But then I tried running the example console app and that worked! So I dunno. I'm connecting over a tunnelled SSH connection to an Azure server, so that might be complicating things.

I hope this info helps.

mfenniak commented 9 years ago

Hm... interesting analysis. Great investigation work. :-)

I don't believe it's 100% accurate to say that "the call to ReadLoop is synchronous". As it is an "async" method, invoking ReadLoop() returns a Task object immediately while the function is executed on the current SynchronizationContext. This behavior is actually questionable, though; depending on the SynchronizationContext, this might not execute the way the code is written to expect. We know exactly how we want ReadLoop to run (in an unaffiliated thread, without any synchronization), so the implicit use of the SynchronizationContext here is probably the source of the problem.

Your experiment to remove ReadLoop entirely unfortunately violates the current design of the Connection class; it is currently thread-safe and can execute multiple queries simultaneously, routing the results to the correct task objects when they are available. It certainly would simplify the Connection class to alter that design, but, I don't think it's quite what I'd want to do.

I'm thinking the best approach here would be to take control over how the ReadLoop is executed. My first thought is to create one thread per Connection and, once connected, run ReadLoop in that thread synchronously; that should definitely "fix" the problem. But, that will create one thread per connection which certainly isn't required.

The second thought is to take an approach similar to RethinkDb.TaskUtilities; free ourselves from the current SynchronizationContext and then spin up ReadLoop, and then restore the synchronization context. This would let ReadLoop be free-threaded regardless of the synchronization context used for creating the connection, which should address the issue. But it worries me a bit because the library is now clearly designed to work around the SynchronizationContext, and doesn't honour it like I would expect a library to.

I still haven't tried to reproduce your test case, but I'll try to do that, and then see if the second approach fixes it. If so, I'm marginally comfortable with that solution.

mfenniak commented 9 years ago

So, I've had trouble reproducing your failure case. I got your code to run fine, but the async button just always worked. I tried to increase the number of requests the webpage made from 20 to 200 and then 2000, but, never ran into the issue.

I've gone ahead and theorized the fix I'd like to make in a branch called issue_216. Is it possible that you could test this and see if it resolves the problem?

andoband commented 9 years ago

I tried out the issue_216 branch and it prevents the errors that I'm seeing. Awesome!!

Have you tried reproducing the issue by accessing a remote (across internet) db? That's when I experienced it, and since I don't have any issues when I'm running against a local machine, I'm guessing it's a latency issue.

I would still recommend the .Net built-in ConfigureAwait(false) mechanism for solving this, but I'll be happy with any fix. :)

Regarding running parallel queries on a connection...I haven't reported this yet, but I can't do that via ASP.Net either. For example:

using (var conn = factory.GetAsync())
{
    var t1 = conn.RunAsync(tUsers.Get("user id 1"));
    var t2 = conn.RunAsync(tUsers.Get("user id 2"));
    await Task.WhenAll(t1, t2);
}

That code will cause the process to take up all the processor time and never complete nor timeout. I have to kill the process. I haven't investigated why that happens, but at least in the ASP.Net scenario, the connection can only be used for one userland operation at a time. ReadLoop() perhaps?

Regardless of these issues Mathieu, thank you sooo much for the work you've put into creating this driver! I wouldn't have given RethinkDB a chance without it.