mediocregopher / radix.v2

Redis client for Go
http://godoc.org/github.com/mediocregopher/radix.v2
MIT License
433 stars 92 forks source link

Hang up for a long time because anything must be done in cluster.spin #51

Closed zhaoyuxi closed 6 years ago

zhaoyuxi commented 7 years ago

More than 100 connections to each redis cluster. Creating one pool with authentication will be spent more than 1 minutes. And it will occupy cluster.spin more than 1 minutes. But probably my service has more than 10 redis server. And 'Cmd' calling will be blocked for a very long time. It's unacceptable for my service. If the network or redis servers have some failure, it becomes much worse.

One 'spin' based on channel is low performance although it's simple. All in here is the bottleneck. And the 'Pool' implementation is also based on it.

My current solution:

  1. Modify 'Pool' implementaion based on list with sync.mutex. So i can get the connection in 'first in first out' order or in 'last in first out' order. 'Ping' is done in 'first in first out' order to check each connection. But 'Cmd' is done in 'last in first out' order. So 'Cmd' always happens in the most active connections. If it's in fifo order like currently, the connection has been in idle for a long long time and it's unavailable in much much higher possibility.
  2. Pool' with 'initial size' ,'idle size' and 'limit size'. So 'initia size' to be very small to reduce the creating 'pool' time.
  3. My implemenation like this:

//heartBeat routine to keep the connetion active. All pools share the same routine func (c Cluster) heartBeat() { for { select { case <-c.pingTicker.C://one second timer allConnNum := c.getConnNum() pools := c.getAllPools() for _, pool := range pools { if pool.Avail() < c.o.IdleSize { // lazy to creat connection con, err := pool.CreateNewClient() if err != nil { } if con != nil { pool.Put(con) } } // calculate how many connections will be checked in each interval. heartbeatNum := c.getHeartBeatConnNum(time.Second, allConnNum, pool) for i := 0; i < heartbeatNum; i++ { // in fifo order to check whether it's idle for 10 second. then ping it. idleCon, err := pool.GetIdleConn(time.Second10)
if err != nil { } else if idleCon == nil { } else { idleCon.Cmd("PING") pool.Put(idleCon) } } } case <-c.stopCh: c.pingTicker.Stop() return } } }

mediocregopher commented 7 years ago

Hey @zhaoyuxi, I know I said I wasn't going to address these problems until v3, but a related issue popped up at work that was probably part of the problem you were having in your two issues. I've worked out a small fix in #53 that effectively makes the Get and Put calls in cluster not block the callCh, only retrieving the pool does.

There's still a potential problem if the reset call is taking forever, but this should help at least normal failover cases, which is what we were having problems with. Let me know if you have any questions!