microsoftarchive / redis

Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes
http://redis.io
Other
20.81k stars 5.37k forks source link

Redis Begins to returns timeouts after being under load for 3 to 4 hours #296

Closed MomentumPete closed 9 years ago

MomentumPete commented 9 years ago

We just went live with Redis on Windows using the Stack Exchange client and experiences a problem. One 3 occasions yesterday, within 3 to 4 hours Redis began to return timeouts to all clients. This seemed to cause our application servers to block and have extreme delays in processing request for all other requests sharing the application pools. The performance for the 4 good hours was excellent, with a which included some of our peak loads ( 1:00 to 4:00). The failures ( 8:30 AM, 11:45 AM and 4:45 PM) were resolved by flushing the db, and restarting redis. After the third failure we disabled replication to eliminate it as a cause. But overnight the load was not significant enough to cause the issue.

We are storing very transient data that is being refreshed by a client application every 5 minutes. We have 10 applications that are loading the data on a schedule, and approximately 1,000 to 1,500 clients that will access the server. The solution is a multi tier .NET c# application with the web site hosted on 3 web servers with a load balancer to our application servers (2) which host our application layer ( which connects to redis). The application servers are 4 core 16 Gig of Ram servers.

Our current configuration post issue, is running a single Redis server with replication off. Below is some additional information on our configuration and current state of the server. Thanks.

Please let me know if you have any guidance on this issue, we are currently resorting to constant monitoring and restarting when we see the issue start to occur.

Thanks

Pete

Config File

Redis configuration file example

Note on units: when memory size is needed, it is possible to specifiy

it in the usual form of 1k 5GB 4M and so forth:

1k => 1000 bytes

1kb => 1024 bytes

1m => 1000000 bytes

1mb => 1024*1024 bytes

1g => 1000000000 bytes

1gb => 1024_1024_1024 bytes

units are case insensitive so 1GB 1Gb 1gB are all the same.

By default Redis does not run as a daemon. Use 'yes' if you need it.

Note that Redis will write a pid file in /var/run/redis.pid when daemonized.

daemonize no

When running daemonized, Redis writes a pid file in /var/run/redis.pid by

default. You can specify a custom pid file location here.

pidfile /var/run/redis.pid

Accept connections on the specified port, default is 6379.

If port 0 is specified Redis will not listen on a TCP socket.

port 6379

If you want you can bind a single interface, if the bind option is not

specified all the interfaces will listen for incoming connections.

bind 127.0.0.1

Specify the path for the unix socket that will be used to listen for

incoming connections. There is no default, so Redis will not listen

on a unix socket when not specified.

unixsocket /tmp/redis.sock

unixsocketperm 755

Close the connection after a client is idle for N seconds (0 to disable)

timeout 0.5

Set server verbosity to 'debug'

it can be one of:

debug (a lot of information, useful for development/testing)

verbose (many rarely useful info, but not a mess like the debug level)

notice (moderately verbose, what you want in production probably)

warning (only very important / critical messages are logged)

loglevel warning

Specify the log file name. Also 'stdout' can be used to force

Redis to log on the standard output. Note that if you use standard

output for logging but daemonize, logs will be sent to /dev/null

logfile "e:/log/redis.log"

To enable logging to the system logger, just set 'syslog-enabled' to yes,

and optionally update the other syslog parameters to suit your needs.

syslog-enabled no

Specify the syslog identity.

syslog-ident redis

Specify the syslog facility. Must be USER or between LOCAL0-LOCAL7.

syslog-facility local0

Set the number of databases. The default database is DB 0, you can select

a different one on a per-connection basis using SELECT where

dbid is a number between 0 and 'databases'-1

databases 16

SNAPSHOTTING

Save the DB on disk:

save

Will save the DB if both the given number of seconds and the given

number of write operations against the DB occurred.

In the example below the behaviour will be to save:

after 900 sec (15 min) if at least 1 key changed

after 300 sec (5 min) if at least 10 keys changed

after 60 sec if at least 10000 keys changed

Note: you can disable saving at all commenting all the "save" lines.

save 900 1 save 300 10 save 60 10000

Compress string objects using LZF when dump .rdb databases?

For default that's set to 'yes' as it's almost always a win.

If you want to save some CPU in the saving child set it to 'no' but

the dataset will likely be bigger if you have compressible values or keys.

rdbcompression yes

The filename where to dump the DB

dbfilename dump.rdb

The working directory.

The DB will be written inside this directory, with the filename specified

above using the 'dbfilename' configuration directive.

Also the Append Only File will be created inside this directory.

Note that you must specify a directory here, not a file name.

dir "C:/Program Files/Redis/data"

REPLICATION

Master-Slave replication. Use slaveof to make a Redis instance a copy of

another Redis server. Note that the configuration is local to the slave

so for example it is possible to configure the slave to save the DB with a

different interval, or to listen to another port, and so on.

slaveof

If the master is password protected (using the "requirepass" configuration

directive below) it is possible to tell the slave to authenticate before

starting the replication synchronization process, otherwise the master will

refuse the slave request.

masterauth

When a slave lost the connection with the master, or when the replication

is still in progress, the slave can act in two different ways:

1) if slave-serve-stale-data is set to 'yes' (the default) the slave will

still reply to client requests, possibly with out of data data, or the

data set may just be empty if this is the first synchronization.

2) if slave-serve-stale data is set to 'no' the slave will reply with

an error "SYNC with master in progress" to all the kind of commands

but to INFO and SLAVEOF.

slave-serve-stale-data yes

Slaves send PINGs to server in a predefined interval. It's possible to change

this interval with the repl_ping_slave_period option. The default value is 10

seconds.

repl-ping-slave-period 10

The following option sets a timeout for both Bulk transfer I/O timeout and

master data or ping response timeout. The default value is 60 seconds.

It is important to make sure that this value is greater than the value

specified for repl-ping-slave-period otherwise a timeout will be detected

every time there is low traffic between the master and the slave.

repl-timeout 60

SECURITY

Require clients to issue AUTH before processing any other

commands. This might be useful in environments in which you do not trust

others with access to the host running redis-server.

This should stay commented out for backward compatibility and because most

people do not need auth (e.g. they run their own servers).

Warning: since Redis is pretty fast an outside user can try up to

150k passwords per second against a good box. This means that you should

use a very strong password otherwise it will be very easy to break.

requirepass foobared

Command renaming.

It is possilbe to change the name of dangerous commands in a shared

environment. For instance the CONFIG command may be renamed into something

of hard to guess so that it will be still available for internal-use

tools but not available for general clients.

Example:

rename-command CONFIG b840fc02d524045429941cc15f59e41cb7be6c52

It is also possilbe to completely kill a command renaming it into

an empty string:

rename-command CONFIG ""

LIMITS

Set the max number of connected clients at the same time. By default there

is no limit, and it's up to the number of file descriptors the Redis process

is able to open. The special value '0' means no limits.

Once the limit is reached Redis will close all the new connections sending

an error 'max number of clients reached'.

maxclients 128

maxclients 5000

Don't use more memory than the specified amount of bytes.

When the memory limit is reached Redis will try to remove keys with an

EXPIRE set. It will try to start freeing keys that are going to expire

in little time and preserve keys with a longer time to live.

Redis will also try to remove objects from free lists if possible.

If all this fails, Redis will start to reply with errors to commands

that will use more memory, like SET, LPUSH, and so on, and will continue

to reply to most read-only commands like GET.

WARNING: maxmemory can be a good idea mainly if you want to use Redis as a

'state' server or cache, not as a real DB. When Redis is used as a real

database the memory usage will grow over the weeks, it will be obvious if

it is going to use too much memory in the long run, and you'll have the time

to upgrade. With maxmemory after the limit is reached you'll start to get

errors for write operations, and this may even lead to DB inconsistency.

maxmemory

maxmemory 10gb

MAXMEMORY POLICY: how Redis will select what to remove when maxmemory

is reached? You can select among five behavior:

volatile-lru -> remove the key with an expire set using an LRU algorithm

allkeys-lru -> remove any key accordingly to the LRU algorithm

volatile-random -> remove a random key with an expire set

allkeys->random -> remove a random key, any key

volatile-ttl -> remove the key with the nearest expire time (minor TTL)

noeviction -> don't expire at all, just return an error on write operations

Note: with all the kind of policies, Redis will return an error on write

operations, when there are not suitable keys for eviction.

At the date of writing this commands are: set setnx setex append

incr decr rpush lpush rpushx lpushx linsert lset rpoplpush sadd

sinter sinterstore sunion sunionstore sdiff sdiffstore zadd zincrby

zunionstore zinterstore hset hsetnx hmset hincrby incrby decrby

getset mset msetnx exec sort

The default is:

maxmemory-policy volatile-lru

maxmemory-policy allkeys-lru

LRU and minimal TTL algorithms are not precise algorithms but approximated

algorithms (in order to save memory), so you can select as well the sample

size to check. For instance for default Redis will check three keys and

pick the one that was used less recently, you can change the sample size

using the following configuration directive.

maxmemory-samples 3

APPEND ONLY MODE

By default Redis asynchronously dumps the dataset on disk. If you can live

with the idea that the latest records will be lost if something like a crash

happens this is the preferred way to run Redis. If instead you care a lot

about your data and don't want to that a single record can get lost you should

enable the append only mode: when this mode is enabled Redis will append

every write operation received in the file appendonly.aof. This file will

be read on startup in order to rebuild the full dataset in memory.

Note that you can have both the async dumps and the append only file if you

like (you have to comment the "save" statements above to disable the dumps).

Still if append only mode is enabled Redis will load the data from the

log file at startup ignoring the dump.rdb file.

IMPORTANT: Check the BGREWRITEAOF to check how to rewrite the append

log file in background when it gets too big.

appendonly no

The name of the append only file (default: "appendonly.aof")

appendfilename appendonly.aof

The fsync() call tells the Operating System to actually write data on disk

instead to wait for more data in the output buffer. Some OS will really flush

data on disk, some other OS will just try to do it ASAP.

Redis supports three different modes:

no: don't fsync, just let the OS flush the data when it wants. Faster.

always: fsync after every write to the append only log . Slow, Safest.

everysec: fsync only if one second passed since the last fsync. Compromise.

The default is "everysec" that's usually the right compromise between

speed and data safety. It's up to you to understand if you can relax this to

"no" that will will let the operating system flush the output buffer when

it wants, for better performances (but if you can live with the idea of

some data loss consider the default persistence mode that's snapshotting),

or on the contrary, use "always" that's very slow but a bit safer than

everysec.

If unsure, use "everysec".

appendfsync always

appendfsync everysec

appendfsync no

When the AOF fsync policy is set to always or everysec, and a background

saving process (a background save or AOF log background rewriting) is

performing a lot of I/O against the disk, in some Linux configurations

Redis may block too long on the fsync() call. Note that there is no fix for

this currently, as even performing fsync in a different thread will block

our synchronous write(2) call.

In order to mitigate this problem it's possible to use the following option

that will prevent fsync() from being called in the main process while a

BGSAVE or BGREWRITEAOF is in progress.

This means that while another child is saving the durability of Redis is

the same as "appendfsync none", that in pratical terms means that it is

possible to lost up to 30 seconds of log in the worst scenario (with the

default Linux settings).

If you have latency problems turn this to "yes". Otherwise leave it as

"no" that is the safest pick from the point of view of durability.

no-appendfsync-on-rewrite no

Automatic rewrite of the append only file.

Redis is able to automatically rewrite the log file implicitly calling

BGREWRITEAOF when the AOF log size will growth by the specified percentage.

This is how it works: Redis remembers the size of the AOF file after the

latest rewrite (or if no rewrite happened since the restart, the size of

the AOF at startup is used).

This base size is compared to the current size. If the current size is

bigger than the specified percentage, the rewrite is triggered. Also

you need to specify a minimal size for the AOF file to be rewritten, this

is useful to avoid rewriting the AOF file even if the percentage increase

is reached but it is still pretty small.

Specify a precentage of zero in order to disable the automatic AOF

rewrite feature.

auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb

SLOW LOG

The Redis Slow Log is a system to log queries that exceeded a specified

execution time. The execution time does not include the I/O operations

like talking with the client, sending the reply and so forth,

but just the time needed to actually execute the command (this is the only

stage of command execution where the thread is blocked and can not serve

other requests in the meantime).

You can configure the slow log with two parameters: one tells Redis

what is the execution time, in microseconds, to exceed in order for the

command to get logged, and the other parameter is the length of the

slow log. When a new command is logged the oldest one is removed from the

queue of logged commands.

The following time is expressed in microseconds, so 1000000 is equivalent

to one second. Note that a negative number disables the slow log, while

a value of zero forces the logging of every command.

slowlog-log-slower-than 10000

There is no limit to this length. Just be aware that it will consume memory.

You can reclaim memory used by the slow log with SLOWLOG RESET.

slowlog-max-len 1024

VIRTUAL MEMORY

WARNING! Virtual Memory is deprecated in Redis 2.4

The use of Virtual Memory is strongly discouraged.

WARNING! Virtual Memory is deprecated in Redis 2.4

The use of Virtual Memory is strongly discouraged.

Virtual Memory allows Redis to work with datasets bigger than the actual

amount of RAM needed to hold the whole dataset in memory.

In order to do so very used keys are taken in memory while the other keys

are swapped into a swap file, similarly to what operating systems do

with memory pages.

To enable VM just set 'vm-enabled' to yes, and set the following three

VM parameters accordingly to your needs.

vm-enabled no

vm-enabled yes

This is the path of the Redis swap file. As you can guess, swap files

can't be shared by different Redis instances, so make sure to use a swap

file for every redis process you are running. Redis will complain if the

swap file is already in use.

The best kind of storage for the Redis swap file (that's accessed at random)

is a Solid State Disk (SSD).

* WARNING * if you are using a shared hosting the default of putting

the swap file under /tmp is not secure. Create a dir with access granted

only to Redis user and configure Redis to create the swap file there.

vm-swap-file "C:/Program Files/Redis/data/redis.swap"

vm-max-memory configures the VM to use at max the specified amount of

RAM. Everything that deos not fit will be swapped on disk if possible, that

is, if there is still enough contiguous space in the swap file.

With vm-max-memory 0 the system will swap everything it can. Not a good

default, just specify the max amount of RAM you can in bytes, but it's

better to leave some margin. For instance specify an amount of RAM

that's more or less between 60 and 80% of your free RAM.

vm-max-memory 0

Redis swap files is split into pages. An object can be saved using multiple

contiguous pages, but pages can't be shared between different objects.

So if your page is too big, small objects swapped out on disk will waste

a lot of space. If you page is too small, there is less space in the swap

file (assuming you configured the same number of total swap file pages).

If you use a lot of small objects, use a page size of 64 or 32 bytes.

If you use a lot of big objects, use a bigger page size.

If unsure, use the default :)

vm-page-size 32

Number of total memory pages in the swap file.

Given that the page table (a bitmap of free/used pages) is taken in memory,

every 8 pages on disk will consume 1 byte of RAM.

The total swap size is vm-page-size * vm-pages

With the default of 32-bytes memory pages and 134217728 pages Redis will

use a 4 GB swap file, that will use 16 MB of RAM for the page table.

It's better to use the smallest acceptable value for your application,

but the default is large in order to work in most conditions.

vm-pages 134217728

Max number of VM I/O threads running at the same time.

This threads are used to read/write data from/to swap file, since they

also encode and decode objects from disk to memory or the reverse, a bigger

number of threads can help with big objects even if they can't help with

I/O itself as the physical device may not be able to couple with many

reads/writes operations at the same time.

The special value of 0 turn off threaded I/O and enables the blocking

Virtual Memory implementation.

vm-max-threads 4

ADVANCED CONFIG

Hashes are encoded in a special way (much more memory efficient) when they

have at max a given numer of elements, and the biggest element does not

exceed a given threshold. You can configure this limits with the following

configuration directives.

hash-max-zipmap-entries 512 hash-max-zipmap-value 64

Similarly to hashes, small lists are also encoded in a special way in order

to save a lot of space. The special representation is only used when

you are under the following limits:

list-max-ziplist-entries 512 list-max-ziplist-value 64

Sets have a special encoding in just one case: when a set is composed

of just strings that happens to be integers in radix 10 in the range

of 64 bit signed integers.

The following configuration setting sets the limit in the size of the

set in order to use this special memory saving encoding.

set-max-intset-entries 512

Similarly to hashes and lists, sorted sets are also specially encoded in

order to save a lot of space. This encoding is only used when the length and

elements of a sorted set are below the following limits:

zset-max-ziplist-entries 128 zset-max-ziplist-value 64

Active rehashing uses 1 millisecond every 100 milliseconds of CPU time in

order to help rehashing the main Redis hash table (the one mapping top-level

keys to values). The hash table implementation redis uses (see dict.c)

performs a lazy rehashing: the more operation you run into an hash table

that is rhashing, the more rehashing "steps" are performed, so if the

server is idle the rehashing is never complete and some more memory is used

by the hash table.

The default is to use this millisecond 10 times every second in order to

active rehashing the main dictionaries, freeing memory when possible.

If unsure:

use "activerehashing no" if you have hard latency requirements and it is

not a good thing in your environment that Redis can reply form time to time

to queries with 2 milliseconds delay.

use "activerehashing yes" if you don't have such hard requirements but

want to free memory asap when possible.

activerehashing yes

INCLUDES

Include one or more other config files here. This is useful if you

have a standard template that goes to all redis server but also need

to customize a few per-server settings. Include files can include

other files, so use this wisely.

include /path/to/local.conf

include /path/to/other.conf

INFO -- Before redis 127.0.0.1:6379> info redis_version:2.4.6 redis_git_sha1:26cdd13a redis_git_dirty:0 arch_bits:64 multiplexing_api:winsock2 gcc_version:4.6.1 process_id:7548 uptime_in_seconds:50684 uptime_in_days:0 lru_clock:1383512 used_cpu_sys:249.94 used_cpu_user:367.84 used_cpu_sys_children:0.00 used_cpu_user_children:0.00 connected_clients:21 connected_slaves:0 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:0 used_memory:38039304 used_memory_human:36.28M used_memory_rss:38039304 used_memory_peak:84201768 used_memory_peak_human:80.30M mem_fragmentation_ratio:1.00 mem_allocator:libc loading:0 aof_enabled:0 changes_since_last_save:-1041 bgsave_in_progress:0 last_save_time:1439898399 bgrewriteaof_in_progress:0 total_connections_received:428313 total_commands_processed:3706181 expired_keys:0 evicted_keys:0 keyspace_hits:1444130 keyspace_misses:495934 pubsub_channels:1 pubsub_patterns:0 latest_fork_usec:0 vm_enabled:0 role:master db0:keys=3800,expires=0

SLOWLOG redis 127.0.0.1:6379> slowlog get 9 1) 1) (integer) 4125 2) (integer) 1439899490 3) (integer) 15621 4) 1) "INFO" 2) 1) (integer) 4124 2) (integer) 1439899486 3) (integer) 15612 4) 1) "INFO" 3) 1) (integer) 4123 2) (integer) 1439899480 3) (integer) 15624 4) 1) "INFO" 4) 1) (integer) 4122 2) (integer) 1439899479 3) (integer) 15630 4) 1) "INFO" 5) 1) (integer) 4121 2) (integer) 1439899479 3) (integer) 15623 4) 1) "INFO" 6) 1) (integer) 4120 2) (integer) 1439899473 3) (integer) 15622 4) 1) "INFO" 7) 1) (integer) 4119 2) (integer) 1439899473 3) (integer) 15624 4) 1) "CONFIG" 2) "GET" 3) "slave-read-only" 8) 1) (integer) 4118 2) (integer) 1439899472 3) (integer) 15622 4) 1) "CONFIG" 2) "GET" 3) "timeout" 9) 1) (integer) 4117 2) (integer) 1439899463 3) (integer) 15620 4) 1) "INFO"

Result from Slow Log 10

fSjf/61+tTdXIy4u8DUgvlCgRw+MgkqzHIYoe+KSuWKejn42bEO4SjQ38Xn1nm6ZOk8CeVnTXOErvMEL 4NoC5HMRQzVWg227pIPY8Z1kLh7bBqgZ/hwHYT0s3DsGgmQOKkIxreXLBMAMInrlElHhFpW5BRUg7otl JORcgL9pxOKX93hAoTZrbNKaxAmoMPkaqtKvQn1Ip/NfgrNhjU6ymLkpgzNTZl0n856VTkg1qB88FpTp OkocHUqfsSyrvcZcit829b3ToHlA4MamjQE4AJOiKUNNoZTTaLrxdqpWRuRnxlkClju8hxBVxokx+pyT 56rQlFW2GNAvlGUBdcHIN1l5kt3NQ4GvCRuicgKrpPcEyFwfAdBF+wKwDUzi98GPSPK66nEmaFdBmZk3 Xq4fMFGBEuj7TCBJdiMkv1VRahCeYR6obtBAq0EjxiHSQA1i3k2DEiSnbLVYkLLp44BUGGHUIdyRpbFO LjagktRTBp4NVwWXSg/lAVQParOLvaKjnm8+pKMKkOuVfeFQEhg5FAWBj4BSQnxf4D8b8FD0S4l9VNI2 0yRsUpD/Or1P2g5fb+1mFLgCEDdWlZLjowsuNJSiO4U2l9Jo1jTnG4YAMFUA9lVT4EpBbrEQ27sxBLsl VFllT6fZI+VmhJBpTmil+PKObUXI3ovJNEdFCXKoXqeXIhrH6PAUk7Q4GvpF7I+BYtfEyJIi0FqsQ7pp 5FN50ReD77KGqrnAILALSm9DunGhREb3+QooKUnaB2bMb1ec60oAxl1ZSg4RQ85kaArhluskf3ZKu0k6 jsZkX+lccSqvNL91+sc1rPvR97OdMfv3+2QsO8fzt+PcG38Hde+7vYjfSrV/tz9vfbzX8d/vttbutXAr 57yc5l/dX/ffcjkF+e81cCPp8c5j3USJbmV595qT+1yL+cnbbHX5J6Odust3u11z72lfi7/L95oa4t76 7+9762NnlrkqvO3Um3lvPXt5xa3JE/1/OUdno7Di/ve4nHkF8/d7hDn7+m9HuI+kd6P/KIVtvq590DO/ Gt5iv9xD3Vsxc8JlDfYgw4hlrWHOB2RdyHzVoOsOwldNNhpQDcjWcphvJfCiiCC9UPoL5Eyhy+jkVpHZ jGylkynS4UqsdCQGIqblaBzZ3k1k0J5MtrLqZtV3cP5AolJDiLoHLzA6UVzL6l/N2gJc9n0q+nnNZiyR guzF8u5ZJHWVN9bBDCeL45+LEDnJe1AeOFaKwiXZ9NfzgxHGeKaoGiR7VATgEVFSyEZiIGMm6Gm52KEL 9XMEpxn4x/PM/Ch65NfBY2efiCnBoGQBgA3e35GAVyIBkjQgLzSOLKnT8efztzPf/r78/vv93y6/1P5f /fXr+vhk3f/+c7/+//73/8/GBakhOpRBQA=" 6) 1) (integer) 4090 2) (integer) 1439899318 3) (integer) 15625 4) 1) "CONFIG" 2) "GET" 3) "timeout" 7) 1) (integer) 4089 2) (integer) 1439899316 3) (integer) 15626 4) 1) "CONFIG" 2) "GET" 3) "databases" 8) 1) (integer) 4088 2) (integer) 1439899308 3) (integer) 15624 4) 1) "INFO" 9) 1) (integer) 4087 2) (integer) 1439899305 3) (integer) 15278 4) 1) "INFO" 10) 1) (integer) 4086 2) (integer) 1439899299 3) (integer) 15624 4) 1) "CONFIG" 2) "GET" 3) "databases" (0.97s)

enricogior commented 9 years ago

Hi @MomentumPete the issue you are reporting, very likely is the same as https://github.com/MSOpenTech/redis/issues/281 We have a fix for that problem, it will be in the next release along with other fixes. I guess you need the fix as soon as possible, in that case I can provide a private build so you don't have to wait the next release (we don't have a date yet). Let me know. Thank you.

MomentumPete commented 9 years ago

Thanks for the quick reply and a couple of questions.

Thanks

Pete

enricogior commented 9 years ago

@MomentumPete no, removing the disk writing doesn't mitigate the problem.

To get the private buil there are two ways:

MomentumPete commented 9 years ago

Thanks. I have emailed for the build. In the mean time should we just schedule a regular restart of redis every few hours ?

Thanks

Pete

enricogior commented 9 years ago

@MomentumPete, a regular restar of redis helps, but the error may occur any time. It would be better to use a sentinel and force the restart only when needed.

MomentumPete commented 9 years ago

Thanks. One more question. Would a topology of separating the writes and reads by having all my clients write to one instance and have replication push them to the second instance where the reads would occur help ? Sorry trying to get a setup that would help us keep things going until, we can get, validate, test and deploy the new build.

Pete

enricogior commented 9 years ago

@MomentumPete, it may help, because it will reduce the probability of running into the conditions that cause the bug.

MomentumPete commented 9 years ago

Thanks. So what are the conditions that cause the bug to occur ?

enricogior commented 9 years ago

@MomentumPete, the bug it's caused by a random connection error, if it occurs in a specific moment while the connection is being accepted, it blocks all following connections. Reducing the number of reads may reduce the probability of that random error to occur.

enricogior commented 9 years ago

Closing this since it turned out to be a client issue.