Improve connection pool performance

dancannon commented 10 years ago

During my investigation of #125 I noticed that the performance of the connection pool is pretty bad method I use for closing connections does not work well with RethinkDB due to the reuse of connections for continue + end queries.

I have looked into removing the connection pool completely as most of the official drivers do not use connection pools however proper concurrency support is pretty important for a Go driver IMO.

mglukhovsky commented 10 years ago

@dancannon, great work on #125, I was just checking on the latest changes to the Go driver and found this related issue.

I wanted to point out that we're working on a connection pool implementation for the official RethinkDB drivers (see rethinkdb/rethinkdb#281), so you might want to look at that discussion (and chime in with your thoughts).

dancannon commented 10 years ago

Thanks @mglukhovsky ! I have been meaning to improve the performance for inserts for a long time and its great to finally get it sorted.

I have seen the thread but have not had time to properly read it yet. I will try to get round to it some time this week.

3d0c commented 10 years ago

Hi there. Not sure it's a right place and may be it's not a problem at all but my misunderstanding. But, i faced one more problem with connections. It looks like gorethink creates new connection for every query and very quickly it gets an too many open files error. MaxIdle, IdleTimeout, MaxActive don't affect.

The following example makes 2000 established connection in a minute.

package main

import (
    r "github.com/dancannon/gorethink"
    "log"
    "time"
)

var rs *r.Session

func main() {
    var err error

    rs, err = r.Connect(r.ConnectOpts{
        Address:     "127.0.0.1:28015",
        Database:    "imon",
        MaxIdle:     10,
        IdleTimeout: 20,
        MaxActive:   5,
    })

    if err != nil {
        log.Println(err)
        return
    }

    for {
        for i := 0; i < 50; i++ {
            _, err := r.Db("imon").Table("items").GetAllByIndex("url", "xxx").Run(rs)
            if err != nil {
                log.Println(err)
                return
            }
        }
        time.Sleep(time.Second * 1)
    }
}

dancannon commented 10 years ago

Hi,

I am aware of the issue and I am in the process of rewriting the connection pool code from scratch to fix these issues. I hope to have this finished ready for the next RethinkDB release.

That being said I will try to look into this issue this evening and see if I can come up with a solution.

dancannon commented 9 years ago

@3d0c apologies for this taking so long but I think I have fixed the issue that you were having. If you are still using the driver I would really appreciate it if you could try running your code with the develop branch and let me know if you still have this issue.

Please note that I have renamed the MaxActive field to MaxOpen.

kofalt commented 9 years ago

FWIW @dancannon I'm hoping to review the develop branch later tomorrow evening if that'll be of any help. Connection pooling strikes me as non-trivial so I figured an extra set of eyes would help.

Are the branch changes pretty self-explanatory? Mind giving a brief synopsis?

dancannon commented 9 years ago

Thanks @kofalt that would be great. Setting up a connection pool properly is definitely not easy to do, especially given how RethinkDB keeps connections open until all of the results have been fetched for a query.

I still haven't properly reviewed my changes to the connection pool as I am now working on improving performance of the driver elsewhere in preparation for the next release so please comment on anything you think might cause any issues or that could be improved.

I made a number of attempts at setting up a connection pool but in the end I decided to look at how the database/sql package uses a connection pool, I also looked at how @neumino processes responses from the database, this allowed me to fix (I hope) the issues with connections being closed at the incorrect times or not being closed at all.

The connection pool now has a MaxIdle field and a MaxOpen field which define the number of connections that can be used and when new connections can be created. When a connection is requested:

if too many connections are open wait for a connection to be released
if there are idle connections and numOpen < maxOpen then return an idle connection
otherwise create a new connection

The way that data is sent and received has also been changed. During the creation of the connection a goroutine is started which continuously reads from the connection and processes the response. When a new query is sent to the database a queryRequest is stored in a map which contains a channel used to return the response from the reading goroutine.

If you have any further questions let me know or we can talk these changes through if that would help.

kofalt commented 9 years ago

Apologies for the delay! I've been sick & miserable all week. After comparing the latest release with develop, my comments are:

I was taken off guard by the unnamed sync.Mutex in the Connection struct, how does that behave? (I haven't used that syntax feature in Go before.) I assume this is allowing you to just say c.Lock() later? link
The updates to c.outstanding and c.requests are not, together, atomic, does it matter that your CAS + map retrieval are not in the same mutex? link

Higher level comments:

Architecturally, I understand that you're storing a map of queries on c.requests to your internal queryRequest struct, which are then looked up by the identifying int64 key. To me this seems like a lot of shared access & mutex fiddling, all to shield end-users from the queryRequest. Why not have unexported fields on the Query and hold state that way? That would perhaps save a lot of request / response lookup matching. Maybe I'm missing something here.
It was difficult to compare the changes since there is (at the moment, necessarily) a lot of concurrency management intertwined with RDB operations. To me, this seems to imply an impedance mismatch - for example, maybe the Cursor struct shouldn't be multi-goroutine safe (I assume that's the purpose of the mutex on that struct & every member function), and less of the code base should be aware of the existence of a connection pool.
I'm a little worried about a pool implementation in a specific database driver. I assume from these changes that faith/pool wasn't getting the job done; is that an irreconcilable problem? I'd be sad if there were arbitrarily many pool implementations, each package-specific (though I understand doing so for pragmatic purposes). This may be the cause for the probable race condition I found below.

In the "not the way I would have done it, but not necessarily wrong" category, I think I would have implemented connection -> query dispatch very differently. If there's going to be an RDB specific pool, are you sure you're using channels to their maximum effectiveness? Us gophers celebrate these perhaps a bit too much, but in this case they might help simplify some of the more fiddly bits.

Spitballing here: how about a single goroutine that's left running to handle the entire pool mess? You could have a few different channel types, each holding a response channel if necessary, and basically have a single thread fulfilling async requests to provide & return connections. Mutex-ish logic is then much more implicit & easier to manage.


// data types as example only, no mutex shielding required as one goroutine has access
free  := []Connection
inUse := []Connection
...

timer := time.NewTimer(idleTimeout)

for { select {

    case newC := <- getConnectionChan:
        con = GetOrCreateConnection()
        newC.response <- con

    case oldC := <- returnConnectionChan:
        ReturnToFreePool(oldC)

    case <- timer.C:
        ScanForAndCloseIdleConnections()

    case data := <- stopChan:
        // Someone called connection.Close(), time to clean up
}}

There are some obvious problems with this suedocode, additional cleverness required - opening a connection (or waiting on a connection to be returned) blocks other requests for connections, idle timeouts wouldn't be exact (waah), and any goroutine panics would need to return their connection! But if this works, then pretty much all your logic could be ignorant of concurrency issues - everything would operate on an open channel and return it when it's done. Basically, hiding an event loop from the rest of your code.

I have not digested the full code base, so some of my remarks may be off-base. Feel free to ignore :)

I ran a quick performance test of a web application using the driver. This is sloppy and unscientific, with each request making two read queries (the same two each time, one a trivial join and another a lookup by ID). I used the popular boom load generator, an ApacheBench replacement in go. Both tests ran 10,000 requests with 100 at a time, on an older quad-core 3.5 Ghz i7 with SSD.

On e590ff2, tagged release v0.5.0, using connection options I've used before:

r.ConnectOpts{
    Address: "localhost:28015",
    Database: "test",

    // Totally arbitrary connection pool options.
    // By default there is only a single connection available.
    MaxActive: 100,
    MaxIdle: 10,
    IdleTimeout: time.Second * 10,
}

Load test results:

$ boom -n 10000 -c 100 -m GET http://127.0.0.1:8080/query

Summary:
  Total:    5.1286 secs.
  Slowest:  0.1569 secs.
  Fastest:  0.0087 secs.
  Average:  0.0510 secs.
  Requests/sec: 1949.8347
  Total Data Received:  1420000 bytes.
  Response Size per Request:    142 bytes.

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.009 [1] |
  0.024 [603]   |∎∎∎∎∎∎∎∎
  0.038 [2357]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.053 [2828]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.068 [2326]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.083 [1189]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.098 [445]   |∎∎∎∎∎∎
  0.112 [176]   |∎∎
  0.127 [59]    |
  0.142 [10]    |
  0.157 [6] |

Latency distribution:
  10% in 0.0267 secs.
  25% in 0.0359 secs.
  50% in 0.0491 secs.
  75% in 0.0629 secs.
  90% in 0.0775 secs.
  95% in 0.0878 secs.
  99% in 0.1096 secs.

Meanwhile the rethinkDB instance (a local 1.15.0 install with default options) had no problem with the ~12k reads/second load: rethink-1

On 843187d, current tip of develop, same ConnectOps with MaxActive renamed to MaxOpen:

$ boom -n 10000 -c 100 -m GET http://127.0.0.1:8080/query

Summary:
  Total:    3.7708 secs.
  Slowest:  0.1107 secs.
  Fastest:  0.0062 secs.
  Average:  0.0375 secs.
  Requests/sec: 2651.9716
  Total Data Received:  1419208 bytes.
  Response Size per Request:    141 bytes.

Status code distribution:
  [500] 18 responses
  [200] 9982 responses

Response time histogram:
  0.006 [1] |
  0.017 [397]   |∎∎∎∎∎
  0.027 [2466]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.038 [2845]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.048 [2100]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.058 [1183]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.069 [576]   |∎∎∎∎∎∎∎∎
  0.079 [266]   |∎∎∎
  0.090 [114]   |∎
  0.100 [44]    |
  0.111 [8] |

Latency distribution:
  10% in 0.0203 secs.
  25% in 0.0258 secs.
  50% in 0.0349 secs.
  75% in 0.0461 secs.
  90% in 0.0585 secs.
  95% in 0.0673 secs.
  99% in 0.0841 secs.

On the RDB side, no problems here: rethink-2

Notice by the status code responses of the second test that there were some 500 failures returned. It looks like the current tip of develop has some race conditions, as the error was always: gorethink: read tcp 127.0.0.1:28015: use of closed network connection. Unfortunately the stack trace is no good, it looks like it comes from the stdlib. I imagine this should (non-deterministically) be reproducible by other heavily-concurrent tests.

This seems to indicate a need for a CI-triggered concurrency "hit it with a hammer" test; at some future date I could definitely submit a PR that does something like the above.

I re-ran this load test on v0.5.0 and develop several times, and couldn't get v0.5.0 to crash while develop dependably did, so I'm pretty sure it's the connection changes.

TL;DR - race condition likely found, might be a good idea to isolate concurrency code from query code.

dancannon commented 9 years ago

Thanks for reading through the code @kofalt, I have had a quick read through your comments and you have raised a couple of really great points and thanks for pointing out the race condition. I definitely think the driver need better concurrent tests.

Unfortunately I have been quite busy with work this week but ill try to have a more thorough read through your comments soon and answer some of the questions you had.

kofalt commented 9 years ago

No problem - this stuff is complex and I 100% appreciate how much effort you've put into this project!

From my review I immediately had the idea of implementing that event-loop-manager pattern as a generic library (that could handle any resource; users would have to cast back from interface{} because lol-golang) and possibly file a PR that tries to make gorethink use it.

An ambitious idea, and one I am not at all sure I have time for, unfortunately ;/

If you have a moment, could you summarize what motivated the removal of fatih/pool? I haven't been through that project's code, but it might help me understand the problem better.

dancannon commented 9 years ago

My decision to move away from using [fatih/pool] was due to the difficult managing the connection throughout a query. This became more of an issue with the 2.0 release as it further simplified the API of the connection pool.

I think it is also worth mentioning that the connection pool that I now intend to use is a modified version of the connection pool used inside database/sql which should hopefully minimise the amount of maintenance required.

dancannon commented 9 years ago

Hey @kofalt, this time its my time to apologise for the delay, unfortunately its for the same reason as you. Anyway here is my more indepth answer to your questions:

You are completely correct about being able to call c.Lock(), this functionality is called "embedding" and you can read more about it here
Good point regarding the atomic operations, in this case I dont think that it would cause any problems due to how the two values are used however I will have a closer look into whats going on myself.
I think the answer to your next three points can be answered at the same time (if I have misunderstood your points please let me know)

One of the main reasons for me deciding to redo the connection pool was that I wanted to tidy up how results were prefetched, this relatively small requirement meant that the driver had to be able to asynchronously process query responses. This is much easier for the official drivers which for the most part are all not safe for concurrent use or send queries synchronously. If you have a look at rethinkdbdash then you will notice that there is also a lot of code for managing the responses. That being said I think you might be correct regarding the cursor code. Removing the locks and concurrency management code could simplify the driver greatly and you probably shouldn’t be doing multiple actions on the cursor concurrently anyway.

While I think your point about not having a new pool implementation makes sense and this is why I initially attempted to use the faith/pool package I think that the fact that trying to get the pool to do things it was not initially designed to do means in the long run it makes more sense just to implement a pool internally. Also due to the fact that gorethinks pool is now heavily inspired by the database/sql package (which is created by people far cleverer than me) I don’t think it will cause any significant issues. This is also my reasoning behind not using a channel based solution (This was also a solution I initially investigated but after reading some posts on the mailing list which suggested against it I decided to go with the current solution, unfortunately I cannot remember what posts I read).
Finally thank you very much for creating this benchmark it is a great idea to test the driver this way, would you be ok with sharing the full code you used? I fully agree with your idea of coming up with a test than can be run on Wrecker and this is something I will try to get sorted before the I release the next version. Until then I will look into replicating the race condition locally and see if I can fix the issue.

I would also like to say just how much I appreciate you taking the time to look at my changes and provide such an in-depth review, it really does make a huge difference.

kofalt commented 9 years ago

you probably shouldn’t be doing multiple actions on the cursor concurrently anyway

+100. Maybe I'm being uncreative, but I cannot imagine concurrently pulling results out of a cursor (in a random order no less). And if someone wants to do that, make them mutex it out :)

Sounds like we're in agreement about those parts. Easy targets for simplification.

trying to get the pool to do things it was not initially designed to do

This is the part I'm least clear on; can you help me understand what goRT needed that it didn't have?

heavily inspired by the database/sql package

Entirely reasonable. I didn't know that in my initial review, and that sounds like a smart idea.

I'm much more comfortable knowing that there was a strong source to draw from. I get nightmares when I imagine folks dreaming up a new mutex-using library in the dead of the night :P

This is also my reasoning behind not using a channel based solution

At this point I'd only attempt a channel-based version if this current branch becomes insanely irritating. Goroutines and channels would be using the same locking concepts under the hood, after all.

The main advantage to be gained there is developer cognitive load, not necessarily better performance.

would you be ok with sharing the full code you used?

Unfortunately the benchmark I ran was on a larger application with a billion libraries and a bunch of crazy stuff that would both be weird to publish, and also distract from the core issue of benchmarking + sniffing for race conditions.

What I can do is cobble together a small repository & script that features a minimal webserver using the same principles which can be hit with the same boom command. Unfortunately I can't promise an ETA on that right now... at some future point we can maybe massage it into a PR.

I'll update the thread when I have something to try out!

dancannon commented 9 years ago

The change that made me move away from fatih/pool was when the Put function was replaced with Close this meant that the only way to return a connection to the pool was to call "close" the connection and this gave me no way to explicitly close specific connections.

Also after reading the comment thread @mglukhovsky mentioned above I realised that I wanted full control over what was happening internally. For example I soon want to add the ability to connect to a RethinkDB cluster and let the driver automatically handle which hosts should be connected to. My thinking was that trying to get the fatih/pool library to handle this use case would be more trouble that its worth.

Dont worry about providing the benchmark code, I was just hoping you had a simple script that you were running. I will get something working and add it to the test suite.

kofalt commented 9 years ago

Sounds good. Let me know if I can be of assistance.

dancannon commented 9 years ago

Hi @kofalt,

I have spent the last couple of days having another look at the connection pool code and I have made quite a few significant changes. The key change I made was simplifying the logic in the Connection type and removing much of the concurrency as this was causing many problems due to the complexity involved, perhaps I will revisit that idea in the future but for now I am quite happy with the result.

Also included some changes to how results are iterated and decoded which I hope should slightly improve performance.

During testing I created a set of simple test scripts for testing the driver(https://gist.github.com/dancannon/245ee2b3f7018ca80981) but if you could run your load test script with the feature/performance_improvements branch that would be greatly appreciated.

Thanks, Daniel

kofalt commented 9 years ago

That's great to hear!

I hope to have feedback for you within a few days.

kofalt commented 9 years ago

First I re-ran the test on e590ff2, tagged release v0.5.0, and confirmed that I've got roughly the same performance with old code, so my test setup looks good.

When I went to go test the new code, I got a lot of errors:

PANIC at GET /query
gorethink: Table `test.tablename` does not exist. in: 
r.Table("tablename").Get(...)

It looks like the new branch breaks the Database option in ConnectOpts. I temporarily patched my code to add a Db("dbname") term as a workaround for now. Are there unit tests covering that option?

Meanwhile, on cda3566 in the feature/performance_improvements branch, with my connection settings from above renamed to MaxOpen and Timeout as per the new options stuct:

$ boom -n 10000 -c 100 -m GET http://127.0.0.1:8080/query

Summary:
  Total:    5.6916 secs.
  Slowest:  0.1591 secs.
  Fastest:  0.0113 secs.
  Average:  0.0567 secs.
  Requests/sec: 1756.9889
  Total Data Received:  1490000 bytes.
  Response Size per Request:    149 bytes.

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.011 [1] |
  0.026 [323]   |∎∎∎∎
  0.041 [2129]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.056 [2998]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.070 [2155]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.085 [1386]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.100 [648]   |∎∎∎∎∎∎∎∎
  0.115 [265]   |∎∎∎
  0.130 [57]    |
  0.144 [18]    |
  0.159 [20]    |

Latency distribution:
  10% in 0.0326 secs.
  25% in 0.0411 secs.
  50% in 0.0532 secs.
  75% in 0.0695 secs.
  90% in 0.0854 secs.
  95% in 0.0960 secs.
  99% in 0.1143 secs.

That was with MaxOpen: 100, MaxIdle: 10, I tried again reducing MaxOpen to 10 to be under the boom concurrency settings of 100 simultaneous requests. After lowering MaxOpen as described, I also ran my load test several times to see if I could reproduce the race condition I found earlier.

Charting a quick comparison:

Percentile	v0.5.0 (max 100 open)	cda3566 (max 100)	cda3566 (max 10)
10%	0.0267	0.0326	0.0326
25%	0.0359	0.0411	0.0412
50%	0.0491	0.0532	0.0539
75%	0.0629	0.0695	0.0701
90%	0.0775	0.0854	0.0871
95%	0.0878	0.0960	0.0990
99%	0.1096	0.1143	0.1299

That last column I think is the most relevant one; your new code shows approximately the same latency when juggling only ten open connections as it does with 100 open. Note that I haven't observed a case where this is significantly faster; if there is one it's beyond my rather-trivial test cases.

The comparison diff since we last spoke is pretty large; I'll try to review this soon.

In summary: performance about the same as v0.5.0, race condition no longer found, small bug with ConnectOpts found, code review pending (aiming for today or tomorrow).

I should also be able to get back to you with any future performance tests without delay.

kofalt commented 9 years ago

Also keep in mind that my tests are obscured by HTTP connections and all the insanity that implies; if you have raw-query tests that show a performance increase beyond v0.5.0 I would probably trust those more. My test is probably most useful for connection pool sanity-checking.

dancannon commented 9 years ago

Thanks for running your benchmark, its a shame the performance has not really improved but at least it looks like the race conditions have been fixed.

Thanks for finding the issue with the database argument, ill create a test and fix the issue. Unfortunately I will not be able to sort that out today as the power has gone out. Ill get to it as soon.

dancannon commented 9 years ago

Another update, I have fixed the issue with the connection arguments not properly being passed with the query.

Regarding the performance, with these latest changes I realised that fixing some of the issues and creating a more stable API are probably more important so while the performance has degraded slightly I am happy with the results. Especially since the simpler connection/cursor will hopefully make it easier to make changes in the future.

That being said before releasing these changes I will do some checking to make sure my recent changes haven't introduced any new issues which might have caused the decreased performance.

Does that seem reasonable?

kofalt commented 9 years ago

Entirely! What I have glanced at in my github change feed has looked reasonable so far.

I now guilt-riddenly realize I didn't follow up with that second code review I promised; is all the latest stuff still in the feature/performance_improvements branch?

Out of curiosity, what were the main performance bottlenecks you were originally hoping to attack? As I mentioned my read-heavy tests & app usage in general have continued to be reasonable with both the stable & feature code.

I'll try really hard to get you feedback within 48 hours :P

dancannon commented 9 years ago

Don't worry about it, the driver is just a side project for me so I completely understand not having the time to review my changes. I greatly appreciate the work you have done so far, it has helped a huge amount.

Originally I had hoped to:

Improve the decoding performance (database map results > structs) using the techniques I learned when working on the encoding code (user types/structs > database ready maps)
Fix the stability issues with the pools and connections. There were a couple of reports of connections being incorrectly closed. This was mainly due to race conditions with the prefetching and concurrent operations inside the connections.
Reduce the load on the garbage collector, I looked into this issue but have decided to leave this item until the next version.

The latest changes are now in feature/performance_improvements I will try to avoid changing the branches again :). If you could focus any code review on the stability of the code rather than the performance for now as this issue has been open for quite some time and I would rather get this sorted ready for the next RethinkDB version.

kofalt commented 9 years ago

I last compared 843187d, so when comparing to feature/performance_improvements I used the latest commit e49e541; comparison link here.

Comments:

I support refactoring mutex & pool logic out of the Connection struct. The code looks a lot cleaner as a result.
Probably at some future date, it would be helpful if you added a few short explanatory comments on unexported and internal functions like processAtomResponse, it would help folks like myself understand & contribute :)
The unexported queue struct appears to be an array-backed FIFO queue implementation with custom growth logic (oh, generics, how we yearn for thee). If I recall correctly, it can be easier to let golang slices handle this, as demonstrated in this gist comment.
There is a decent amount of decode refactoring that I'm largely taking on faith. Nothing jumped out at me during a quick skim.
I can see the complexity problems you mentioned earlier with decoding dependencies. I'm not 100% certain I followed the logic of Pool correctly, if that worries you we should probably hop in IRC sometime to discuss it in depth.
The new TestConcurrentSelectManyWorkers is aces, and could hopefully find race conditions such as the one I previously found with performance testing. Notably these suites of tests will probably need to be ran a lot to find something, so something like while true; do go test; done might be relevant...

This snippet will not work:

func (c *Cursor) loadNext(dest interface{}) (bool, error) {
    c.Lock()

    // ... logic that depends on c ...

    c.Unlock()
    err := c.fetchMore()  // fetchMore calls c.Lock()

Logic earlier in loadNext depends on state in the Cursor, such as c.buffer.Len(), but unlocking & relocking is a race condition by virtue of no longer having loadNext be a single atomic operation.

Golang sadly lacks reentrant locks, and Go community advice is generally condescending, wrong, and insane on this topic (an unfortunate pattern with some missing features). Neglecting to call c.Unlock() isn't an option either, resulting in a trivial deadlock.

Seeing as it's entirely reasonable to have struct helper functions that require exclusive access to itself, there's a few options for fixing this up, such as changing functions like fetchMore to not (un)lock, then make 500% sure you're only calling it from places you already hold the mutex. This appears to be the methodology you've adopted in the Pool struct with functions such as maybeOpenNewConnections.

Related: my below question asking if Cursor still needs a mutex for its operation.

Questions:

What is the purpose of c.conn.SetDeadline(time.Time{})? Is that effectively setting deadline to zero? net/http ref
I notice you still have a map[int64]*Cursor on the Connection struct, which is not protected by a mutex. The behaviour of a map when acted on concurrently in undefined, so I assume you are removing the assumption that a Connection can be accessed by more than one goroutine at a time (if so, I support this). Maybe it'd be a good idea to note this in the documentation?
If the above understanding is correct, is there still a need for a mutex on the Cursor struct? Myself, I would advocate someone wrap reading from a cursor with a goroutine + channels if they really needed concurrent access; I can't see the use case. Thoughts?
Um... could you help me understand why byte equivalence a la foldFunc is needed for caching fields? Bit startling haha.

Hitting it with my simple load tests, I couldn't get it to fall over or error out like last time. As usual, I kept my application the same, while renaming MaxActive to MaxOpen and IdleTimeout to Timeout (keeping them at 100 and 10 seconds, respectively).

The graphs are unsurprising enough to omit, but here's the new percentiles (on the left). Each of these used that version's equivalent of 10-100 connections with 10 second timeouts. The older max-10 column has been removed.

Percentile	v0.5.0	cda3566	e49e541 (latest)
10%	0.0267	0.0326	0.0287
25%	0.0359	0.0411	0.0352
50%	0.0491	0.0532	0.0448
75%	0.0629	0.0695	0.0562
90%	0.0775	0.0854	0.0694
95%	0.0878	0.0960	0.0791
99%	0.1096	0.1143	0.0953

Again, this is only benchmarking simple reads, so I wouldn't put too much stock in it. If anything, it confirms the performance floor has not unduly rising (if anything, it's a hair faster).

Overall, this looks good. The changes we've discussed previously upthread and in IRC seem to have worked out well.

TLDR: looks good, holds up under load, some race condition logic found in Cursor.

dancannon commented 9 years ago

Hey @kofalt, congratulations on the new job and thanks again for having another look at my changes!

First I'll try to cover some of your comments:

Good point, I have been intending to properly commenting the drivers code and ideally ensuring that it passes govet, golint, goimports. As part of this I hope to improve some of the internal documentation.
I agree the linked solution is simpler however I am going to ignore this for now since I do not wish to many any further changes to this branch that are not required and I think any code that reduces any GC load is a good thing. Hopefully I can revisit this in the future.
I am pretty confident with the changes to decoding since they the changes are similar to the changes I recently made to the encoding process. I found a couple of issues over the weekend regarding geospatial types that I need to fix but I think I will work on that separately.
Im not too worried with this right now, it looks like connections are much more stable than before so I think there are some benefits of this new approach.
Thanks for the idea!

Now for some answers to your questions:

This is just to reset the connection after use, and to ensure that the connection will not time out when it is idling in the pool
Correct the idea is that a connection can only be used by one goroutine at a given time. If you wish to use a single session concurrently then the pool can be used. I think this makes the most sense. Ill add some documentation to Connection before merging.
I suppose you are right, now that I have removed the asynchronous prefetching there is no need for locking the cursor. I will make sure this is correct and then remove the locks, I will also comment the Cursor to make sure the fact its not thread safe is clear.
Now this one is basically just because I cheated and copied a lot of the encoding code from encoding/json and this was the method they used for the field caching.

Once I have fixed these issues and assuming there are no further issues I will try to release the next version this weekend since the performance is looked pretty good (and hopefully quite stable) . That way things can hopefully get tested a bit before the next database release which should happen soon I believe.

kofalt commented 9 years ago

Awesome. Glad to hear it. I think it's definitely converging on something solid :)

kofalt commented 9 years ago

It looks like 1.16 has landed with a changefeed roughly as long as I am tall. If you need any help on this or any other issue, just tag me @kofalt in a thread somewhere and I'll swing by :sunglasses:

dancannon commented 9 years ago

Thanks but I think I have actually got the driver mostly updated. I just need to test things and get the release packaged up. 1.16 took much less time to update that I thought it would!

I hope to release the next version in the next couple of days.

rethinkdb / rethinkdb-go

Improve connection pool performance #130