Closed JeffreyDevloo closed 8 years ago
During configure_disk we saw some problems and starting investigating. Apparently the ovs-watcher-framework is having etcd problems again and it shuts the whole thing down.
2016-08-03 14:59:57 15400 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.job - 93 - DEBUG - Task accepted: ovs.storagerouter.configure_disk[01dbe351-aa53-414f-91ff-c0b2025853cd] pid:11743
2016-08-03 14:59:57 23700 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 16 - INFO - Ensure single CHAINED mode - ID 1470229197_BUKf5OFmk6 - New task ovs.storagerouter.configure_disk with p
arams {'partition_guid': None, 'roles': [], 'disk_guid': u'572f35ee-6448-4553-8024-e5e04f7090bc', 'storagerouter_guid': 'f562bee9-bf82-4e0d-abfb-5aa4ae132dbb', 'offset': 0, 'size': 100030242816} scheduled
for execution
2016-08-03 14:59:57 30400 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 17 - INFO - Ensure single CHAINED mode - ID 1470229197_Atoyargo38 - New task ovs.disk.sync_with_reality with params
{'storagerouter_guid': 'f562bee9-bf82-4e0d-abfb-5aa4ae132dbb'} scheduled for execution
2016-08-03 15:00:05 83000 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 18 - INFO - Ensure single CHAINED mode - ID 1470229197_Atoyargo38 - Task ovs.disk.sync_with_reality finished success
fully
2016-08-03 15:00:05 91800 +0200 - cmp02 - 11743/140682450937664 - lib/storagerouter - 19 - DEBUG - Creating new partition - Offset: 0 bytes - Size: 100030242816 bytes - Roles: []
2016-08-03 15:00:06 37100 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.strategy - 94 - INFO - Received task: ovs.storagerouter.ping[2295218a-6b98-4e47-a4e1-45beb34f71a0]
2016-08-03 15:00:06 37100 +0200 - cmp02 - 11727/140682450937664 - celery/celery.pool - 95 - DEBUG - TaskPool: Apply <function _fast_trace_task at 0x7ff329fe1f50> (args:('ovs.storagerouter.ping', '2295218a
-6b98-4e47-a4e1-45beb34f71a0', ('f562bee9-bf82-4e0d-abfb-5aa4ae132dbb', 1470229205), {}, {'utc': True, u'is_eager': False, 'chord': None, u'group': None, 'args': ('f562bee9-bf82-4e0d-abfb-5aa4ae132dbb', 1
470229205), 'retries': 0, u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'sr.4V7P4H6F4ELMai9B', u'exchange': u'generic'}, 'expires': None, u'hostname': 'celery@cmp02', 'task':
'ovs.storagerouter.ping', 'callbacks': None, u'correlation_id': u'2295218a-6b98-4e47-a4e1-45beb34f71a0', 'errbacks': None, 'timelimit': (None, None), 'taskset': None, 'kwargs': {}, 'eta': None, u'reply_to
': u'eb12a33b-c2e3-30f3-a3ca-7167294c81bd', 'id': '2295218a-6b98-4e47-a4e1-45beb34f71a0', u'headers': {}}) kwargs:{})
2016-08-03 15:00:06 37200 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.job - 96 - DEBUG - Task accepted: ovs.storagerouter.ping[2295218a-6b98-4e47-a4e1-45beb34f71a0] pid:11744
2016-08-03 15:00:06 49200 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.job - 97 - INFO - Task ovs.storagerouter.ping[2295218a-6b98-4e47-a4e1-45beb34f71a0] succeeded in 0.120071053039s: Non
e
2016-08-03 15:00:09 04000 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 20 - INFO - Ensure single CHAINED mode - ID 1470229209_AaXclsunxu - New task ovs.disk.sync_with_reality with params
{'storagerouter_guid': 'f562bee9-bf82-4e0d-abfb-5aa4ae132dbb'} scheduled for execution
2016-08-03 15:00:17 63500 +0200 - cmp02 - 11743/140682450937664 - lib/scheduled tasks - 21 - INFO - Ensure single CHAINED mode - ID 1470229209_AaXclsunxu - Task ovs.disk.sync_with_reality finished success
fully
2016-08-03 15:00:17 67900 +0200 - cmp02 - 11743/140682450937664 - lib/storagerouter - 22 - DEBUG - Partition created
2016-08-03 15:00:17 68000 +0200 - cmp02 - 11743/140682450937664 - lib/storagerouter - 23 - DEBUG - Creating filesystem
worker: Warm shutdown (MainProcess)
...
ESC[1;31m[2016-08-03 15:01:43,618: ERROR/MainProcess] Process 'Worker-2' pid:11744 exited with 'signal 15 (SIGTERM)'ESC[0m
ESC[1;31m[2016-08-03 15:01:44,640: ERROR/MainProcess] Process 'Worker-1' pid:11743 exited with 'signal 15 (SIGTERM)'ESC[0m
2016-08-03 15:01:44 65400 +0200 - cmp02 - 11727/140682450937664 - celery/celery.worker.job - 122 - ERROR - Task ovs.storagerouter.configure_disk[01dbe351-aa53-414f-91ff-c0b2025853cd] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/billiard/pool.py", line 1175, in mark_as_worker_lost
human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
/var/log/syslog
Aug 3 15:01:43 cmp02 kernel: [3025534.677653] init: ovs-watcher-framework main process (11671) terminated with status 1
Aug 3 15:01:43 cmp02 kernel: [3025534.677664] init: ovs-watcher-framework main process ended, respawning
Aug 3 15:01:43 cmp02 kernel: [3025534.690557] init: ovs-volumerouter-consumer main process (11673) killed by TERM signal
/var/log/upstart/ovs-watcher-framework.log
2016-08-03 14:48:10 35800 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 9 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 14:48:26 36100 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 11 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure
2016-08-03 14:48:42 36400 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 13 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 14:48:58 36700 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 15 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 14:49:14 37100 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 17 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 14:49:15 37200 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 18 - DEBUG - [framework] Volatile store not working correctly
2016-08-03 14:49:15 37200 +0200 - cmp02 - 22759/139997566920512 - extensions/watcher - 19 - DEBUG - [framework] One of the master services is unavailable
2016-08-03 14:49:16 69800 +0200 - cmp02 - 11572/140034382976832 - extensions/watcher - 0 - DEBUG - [framework] Starting service
2016-08-03 14:49:16 69800 +0200 - cmp02 - 11572/140034382976832 - extensions/watcher - 1 - DEBUG - [framework] Waiting for master services
2016-08-03 14:49:17 87100 +0200 - cmp02 - 11572/140034382976832 - extensions/watcher - 2 - DEBUG - [framework] Master services available
2016-08-03 14:49:18 29100 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 0 - DEBUG - [framework] Starting service
2016-08-03 14:49:18 29200 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 1 - DEBUG - [framework] Checking master services
2016-08-03 15:00:36 51100 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 3 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:00:52 51400 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 5 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:01:08 51800 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 7 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:01:24 52100 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 9 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:01:40 52300 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 11 - DEBUG - [framework] Error during volatile store test: Raft Internal Error : etcdserver: request timed out
2016-08-03 15:01:41 52500 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 12 - DEBUG - [framework] Volatile store not working correctly
2016-08-03 15:01:41 52500 +0200 - cmp02 - 11682/140360658032448 - extensions/watcher - 13 - DEBUG - [framework] One of the master services is unavailable
2016-08-03 15:01:45 00700 +0200 - cmp02 - 14818/140250522412864 - extensions/watcher - 0 - DEBUG - [framework] Starting service
2016-08-03 15:01:45 00700 +0200 - cmp02 - 14818/140250522412864 - extensions/watcher - 1 - DEBUG - [framework] Waiting for master services
2016-08-03 15:01:46 29000 +0200 - cmp02 - 14818/140250522412864 - extensions/watcher - 2 - DEBUG - [framework] Master services available
2016-08-03 15:01:46 58800 +0200 - cmp02 - 14836/140100687357760 - extensions/watcher - 0 - DEBUG - [framework] Starting service
2016-08-03 15:01:46 58800 +0200 - cmp02 - 14836/140100687357760 - extensions/watcher - 1 - DEBUG - [framework] Checking master services
This is not an issue with the framework nor ovs-watcher-framework. The watcher is designed to monitor the services we depend on, and in case it detects an issue with one of them, shuts down the framework to make sure not every task starts to fail at random places. As soon as the services are running again, the watcher will automatically start the framework again.
As discussed with @pploegaert, increasing the timeouts of Etcd itself might make it suck less. Feel free to update this ticket with a more reasonable timeout (that manages to get through the tests), then we can make that a default value.
openvstorage-2.7.2-rev.3835.9f486e5
Could reproduce the issue on 2.7.2-rev.3833.69ff307-1 with: exec etcd -name SHxPMGoM0OZ5swLd -heartbeat-interval=1000 -election-timeout=5000 and saw Raft Internal Error on multiple nodes.
Could not reproduce the issue on 2.7.2-rev.3867.ec9d46d-1 with : exec etcd -name 8IABN2BZLuJPFj06 -snapshot-count=2500 -heartbeat-interval=2500 -election-timeout=25000 (This is now the default)
Hyperconverged setups 2.7.2-rev.3833.69ff307-1:
2.7.2-rev.3867.ec9d46d-1:
We saw some problems with ETCD in
ovs-watcher-framework