openvstorage / arakoon

The consistent distributed key-value store in Open vStorage.
http://arakoon.org/
Apache License 2.0
28 stars 7 forks source link

Arakoon services in start/fail/start loop and dumping many debug logs to /opt/OpenvStorage #194

Open jake9050 opened 7 years ago

jake9050 commented 7 years ago

After updating pocops to the latest Fargo release the arakoon services get stuck in a loop where they constantly restart. The ovs homefolder gets populated with files caled console:.debug.TIMESTAMP.xxxxxx that contain these kinds of messages:

1502284724: main debug: 7679026 => store 1502284724: main debug: Store.incr_i old_i:Some ("7679025") -> new_i:7679026 1502284724: main debug: 7679027 => store 1502284724: main debug: Store.incr_i old_i:Some ("7679026") -> new_i:7679027 1502284724: main debug: 7679028 => store 1502284724: main debug: Store.incr_i old_i:Some ("7679027") -> new_i:7679028 1502284724: main debug: after_block 1502284724: main debug: _fold_blocks 1502284724: main info: Completed replay of 1535.tlx, took 0.502017 seconds, 1 to go 1502284724: main info: Replaying tlog file: 1536.tlog [7679030,...] (2/2) 1502284724: tlog_map debug: fold_read extension=.tlog => index':Some {filename="/mnt/ssd1/arakoon/flash-10-nsm_15/tlogs/1536.tlog";mapping=} 1502284724: main debug: U.fold 7679029 Some ("7679092") ~index:Some {filename="/mnt/ssd1/arakoon/flash-10-nsm_15/tlogs/1536.tlog";mapping=} 1502284724: main debug: maybe_fast_forward 7679029 with Some {filename="/mnt/ssd1/arakoon/flash-10-nsm_15/tlogs/1536.tlog";mapping=} 1502284724: main debug: 7679029 => store 1502284724: main debug: Store.incr_i old_i:Some ("7679028") -> new_i:7679029 1502284724: main debug: 7679030 => skip 1502284724: main debug: 7679029 => store 1502284724: tlog_map debug: filename:/mnt/ssd1/arakoon/flash-10-nsm_15/tlogs/1536.tlog(Failure "update 7679029, store @ 7679029 don't fit") 1502284724: main fatal: going down(Failure "update 7679029, store @ 7679029 don't fit") 1502284724: main fatal: after pick

This eventually fills the disk causing more trouble.

System info os: Ubuntu 16.04.3 LTS

OVS components

`ii alba 1.3.14 amd64 the ALternative BAckend ii arakoon 1.9.17 amd64 Simple consistent distributed key/value store ii openvstorage 2.8.2-1 amd64 openvStorage ii openvstorage-backend 1.8.1-1 amd64 openvStorage Backend plugin ii openvstorage-backend-core 1.8.1-1 amd64 openvStorage Backend plugin core ii openvstorage-backend-webapps 1.8.1-1 amd64 openvStorage Backend plugin Web Applications ii openvstorage-core 2.8.2-1 amd64 openvStorage core ii openvstorage-hc 1.8.1-1 amd64 openvStorage Backend plugin HyperConverged ii openvstorage-health-check 3.2.0-fargo.3-1 amd64 Open vStorage HealthCheck ii openvstorage-sdm 1.7.1-1 amd64 Open vStorage Backend ASD Manager ii openvstorage-webapps 2.8.2-1 amd64 openvStorage Web Applications

`

wimpers commented 7 years ago

@jake9050 any idea why the Arakoon was stuck in a start/fail/start loop ?

wimpers commented 7 years ago

@jtorreke any idea why Arakoon acted up?

jtorreke commented 7 years ago

It was lagging behind too much and could no longer catch up from cluster members. Throwing out the local data and start a new copy was the solution.

wimpers commented 7 years ago

@jtorreke was the root cause of not being able to catchup that the messages were too big?