Ovs-watcher-framework kills himself due to `Raft Internal Error : etcdserver: request timed out`

JeffreyDevloo commented 8 years ago

We saw some problems with ETCD in ovs-watcher-framework

2016-08-02 09:44:41 95300 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 0 - DEBUG - [framework] Starting service
2016-08-02 09:44:41 95300 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 1 - DEBUG - [framework] Checking master services
2016-08-02 13:01:39 27900 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 4 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure
2016-08-02 13:08:05 74100 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 7 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure
2016-08-02 13:35:11 83300 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 10 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-02 13:35:27 83600 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 12 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-02 13:35:43 83900 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 14 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-02 13:35:59 84200 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 16 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-02 13:36:15 84500 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 18 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-02 13:36:16 84600 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 19 - DEBUG - [framework]   Volatile store not working correctly
2016-08-02 13:36:16 84600 +0200 - cmp04 - 30044/139653356341056 - extensions/watcher - 20 - DEBUG - [framework] One of the master services is unavailable

kinvaris commented 8 years ago

During configure_disk we saw some problems and starting investigating. Apparently the ovs-watcher-framework is having etcd problems again and it shuts the whole thing down.

2016-08-03 14:59:57 15400 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.job - 93 - DEBUG - Task accepted: ovs.storagerouter.configure_disk[01dbe351-aa53-414f-91ff-c0b2025853cd] pid:11743
2016-08-03 14:59:57 23700 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 16 - INFO - Ensure single CHAINED mode - ID 1470229197_BUKf5OFmk6 - New task ovs.storagerouter.configure_disk with p
arams {'partition_guid': None, 'roles': [], 'disk_guid': u'572f35ee-6448-4553-8024-e5e04f7090bc', 'storagerouter_guid': 'f562bee9-bf82-4e0d-abfb-5aa4ae132dbb', 'offset': 0, 'size': 100030242816} scheduled
 for execution
2016-08-03 14:59:57 30400 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 17 - INFO - Ensure single CHAINED mode - ID 1470229197_Atoyargo38 - New task ovs.disk.sync_with_reality with params 
{'storagerouter_guid': 'f562bee9-bf82-4e0d-abfb-5aa4ae132dbb'} scheduled for execution
2016-08-03 15:00:05 83000 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 18 - INFO - Ensure single CHAINED mode - ID 1470229197_Atoyargo38 - Task ovs.disk.sync_with_reality finished success
fully
2016-08-03 15:00:05 91800 +0200 - cmp02 - 11743/140682450937664 - lib/storagerouter - 19 - DEBUG - Creating new partition - Offset: 0 bytes - Size: 100030242816 bytes - Roles: []
2016-08-03 15:00:06 37100 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.strategy - 94 - INFO - Received task: ovs.storagerouter.ping[2295218a-6b98-4e47-a4e1-45beb34f71a0]
2016-08-03 15:00:06 37100 +0200 - cmp02 - 11727/140682450937664 - celery/celery.pool - 95 - DEBUG - TaskPool: Apply <function _fast_trace_task at 0x7ff329fe1f50> (args:('ovs.storagerouter.ping', '2295218a
-6b98-4e47-a4e1-45beb34f71a0', ('f562bee9-bf82-4e0d-abfb-5aa4ae132dbb', 1470229205), {}, {'utc': True, u'is_eager': False, 'chord': None, u'group': None, 'args': ('f562bee9-bf82-4e0d-abfb-5aa4ae132dbb', 1
470229205), 'retries': 0, u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'sr.4V7P4H6F4ELMai9B', u'exchange': u'generic'}, 'expires': None, u'hostname': 'celery@cmp02', 'task': 
'ovs.storagerouter.ping', 'callbacks': None, u'correlation_id': u'2295218a-6b98-4e47-a4e1-45beb34f71a0', 'errbacks': None, 'timelimit': (None, None), 'taskset': None, 'kwargs': {}, 'eta': None, u'reply_to
': u'eb12a33b-c2e3-30f3-a3ca-7167294c81bd', 'id': '2295218a-6b98-4e47-a4e1-45beb34f71a0', u'headers': {}}) kwargs:{})
2016-08-03 15:00:06 37200 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.job - 96 - DEBUG - Task accepted: ovs.storagerouter.ping[2295218a-6b98-4e47-a4e1-45beb34f71a0] pid:11744
2016-08-03 15:00:06 49200 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.job - 97 - INFO - Task ovs.storagerouter.ping[2295218a-6b98-4e47-a4e1-45beb34f71a0] succeeded in 0.120071053039s: Non
e
2016-08-03 15:00:09 04000 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 20 - INFO - Ensure single CHAINED mode - ID 1470229209_AaXclsunxu - New task ovs.disk.sync_with_reality with params 
{'storagerouter_guid': 'f562bee9-bf82-4e0d-abfb-5aa4ae132dbb'} scheduled for execution
2016-08-03 15:00:17 63500 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 21 - INFO - Ensure single CHAINED mode - ID 1470229209_AaXclsunxu - Task ovs.disk.sync_with_reality finished success
fully
2016-08-03 15:00:17 67900 +0200 - cmp02 - 11743/140682450937664 - lib/storagerouter - 22 - DEBUG - Partition created
2016-08-03 15:00:17 68000 +0200 - cmp02 - 11743/140682450937664 - lib/storagerouter - 23 - DEBUG - Creating filesystem

worker: Warm shutdown (MainProcess)

...

ESC[1;31m[2016-08-03 15:01:43,618: ERROR/MainProcess] Process 'Worker-2' pid:11744 exited with 'signal 15 (SIGTERM)'ESC[0m
ESC[1;31m[2016-08-03 15:01:44,640: ERROR/MainProcess] Process 'Worker-1' pid:11743 exited with 'signal 15 (SIGTERM)'ESC[0m
2016-08-03 15:01:44 65400 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.job - 122 - ERROR - Task ovs.storagerouter.configure_disk[01dbe351-aa53-414f-91ff-c0b2025853cd] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line 1175, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).

/var/log/syslog

Aug  3 15:01:43 cmp02 kernel: [3025534.677653] init: ovs-watcher-framework main process (11671) terminated with status 1
Aug  3 15:01:43 cmp02 kernel: [3025534.677664] init: ovs-watcher-framework main process ended, respawning
Aug  3 15:01:43 cmp02 kernel: [3025534.690557] init: ovs-volumerouter-consumer main process (11673) killed by TERM signal

/var/log/upstart/ovs-watcher-framework.log

2016-08-03 14:48:10 35800 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 9 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 14:48:26 36100 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 11 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure
2016-08-03 14:48:42 36400 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 13 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 14:48:58 36700 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 15 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 14:49:14 37100 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 17 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 14:49:15 37200 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 18 - DEBUG - [framework]   Volatile store not working correctly
2016-08-03 14:49:15 37200 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 19 - DEBUG - [framework] One of the master services is unavailable
2016-08-03 14:49:16 69800 +0200 - cmp02 - 11572/140034382976832 - extensions/watcher - 0 - DEBUG - [framework] Starting service
2016-08-03 14:49:16 69800 +0200 - cmp02 - 11572/140034382976832 - extensions/watcher - 1 - DEBUG - [framework] Waiting for master services
2016-08-03 14:49:17 87100 +0200 - cmp02 - 11572/140034382976832 - extensions/watcher - 2 - DEBUG - [framework] Master services available
2016-08-03 14:49:18 29100 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 0 - DEBUG - [framework] Starting service
2016-08-03 14:49:18 29200 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 1 - DEBUG - [framework] Checking master services
2016-08-03 15:00:36 51100 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 3 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:00:52 51400 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 5 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:01:08 51800 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 7 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:01:24 52100 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 9 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:01:40 52300 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 11 - DEBUG - [framework]   Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:01:41 52500 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 12 - DEBUG - [framework]   Volatile store not working correctly
2016-08-03 15:01:41 52500 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 13 - DEBUG - [framework] One of the master services is unavailable
2016-08-03 15:01:45 00700 +0200 - cmp02 - 14818/140250522412864 - extensions/watcher - 0 - DEBUG - [framework] Starting service
2016-08-03 15:01:45 00700 +0200 - cmp02 - 14818/140250522412864 - extensions/watcher - 1 - DEBUG - [framework] Waiting for master services
2016-08-03 15:01:46 29000 +0200 - cmp02 - 14818/140250522412864 - extensions/watcher - 2 - DEBUG - [framework] Master services available
2016-08-03 15:01:46 58800 +0200 - cmp02 - 14836/140100687357760 - extensions/watcher - 0 - DEBUG - [framework] Starting service
2016-08-03 15:01:46 58800 +0200 - cmp02 - 14836/140100687357760 - extensions/watcher - 1 - DEBUG - [framework] Checking master services

khenderick commented 8 years ago

This is not an issue with the framework nor ovs-watcher-framework. The watcher is designed to monitor the services we depend on, and in case it detects an issue with one of them, shuts down the framework to make sure not every task starts to fail at random places. As soon as the services are running again, the watcher will automatically start the framework again.

khenderick commented 8 years ago

As discussed with @pploegaert, increasing the timeouts of Etcd itself might make it suck less. Feel free to update this ticket with a more reasonable timeout (that manages to get through the tests), then we can make that a default value.

pploegaert commented 8 years ago

openvstorage-2.7.2-rev.3835.9f486e5

JeffreyDevloo commented 8 years ago

Reproduction of the issue

Could reproduce the issue on 2.7.2-rev.3833.69ff307-1 with: exec etcd -name SHxPMGoM0OZ5swLd -heartbeat-interval=1000 -election-timeout=5000 and saw Raft Internal Error on multiple nodes.

Validated possible fix

Could not reproduce the issue on 2.7.2-rev.3867.ec9d46d-1 with : exec etcd -name 8IABN2BZLuJPFj06 -snapshot-count=2500 -heartbeat-interval=2500 -election-timeout=25000 (This is now the default)

Steps:

Added the roles to the disks
Checked the /var/log/upstart/ovs-watcher-framework.log for 'Raft internal error' (Found none and adding roles was not interrupted: PASSED)

Setups

Hyperconverged setups 2.7.2-rev.3833.69ff307-1:

Three nodes with 4 disks

2.7.2-rev.3867.ec9d46d-1:

Three nodes with 8 disks

openvstorage / framework