openvstorage / arakoon

The consistent distributed key-value store in Open vStorage.
http://arakoon.org/
Apache License 2.0
28 stars 7 forks source link

Arakoon dies when --copy-db-to-head while --optimize-db was already running #203

Open jtorreke opened 7 years ago

jtorreke commented 7 years ago

The scenario: run --optimize-db on a running member. While this is executing, launch --copy-db-to-head on the same instance. Both commands will bail out, the arakoon member will crash.

Output of the --optimize-db command:

root@NY1SRV0008:/mnt/ssd1/arakoon/ny1-hddbackend04-nsm_04# arakoon --optimize-db ny1-hddbackend04-nsm_04 172.17.16.11 26450
Uncaught exception:

  End_of_file

Raised at file "src/core/lwt.ml", line 805, characters 16-23
Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Called from file "src/main/arakoon.ml" (inlined), line 517, characters 21-136
Called from file "src/main/arakoon.ml", line 612, characters 7-23
Called from file "src/main/arakoon.ml", line 626, characters 9-16
root@NY1SRV0008:/mnt/ssd1/arakoon/ny1-hddbackend04-nsm_04#

Arakoon's log file:

017-08-31 10:43:39 632236 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078398 - info - copy_db_to_head tlogs_to_keep:10
2017-08-31 10:43:39 632248 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078399 - info - quiesce_db: Pushing quiesce request
2017-08-31 10:43:39 632255 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078400 - info - quiesce_db: waiting for quiesce request to be completed
2017-08-31 10:43:39 632306 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078401 - fatal - Exception in fsm thread: (Failure "Store already quiesced. Blocking second attempt")
2017-08-31 10:43:39 632353 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078402 - fatal - going down: (Failure "Store already quiesced. Blocking second attempt")
2017-08-31 10:43:39 632360 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078403 - fatal - after pick
2017-08-31 10:43:39 632425 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078404 - info - going to drop outgoing connection as well: Lwt.Canceled
2017-08-31 10:43:39 632449 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078405 - info - going to drop outgoing connection as well: Lwt.Canceled
2017-08-31 10:43:39 632462 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078406 - info - going to drop outgoing connection as well: Lwt.Canceled
2017-08-31 10:43:39 632469 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078407 - info - waiting for 3 client_threads
2017-08-31 10:43:39 632521 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078408 - info - waiting for 2 client_threads
2017-08-31 10:43:39 632612 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078409 - warning - exception while closing, too little too late: Unix.Unix_error(Unix.EBADF, "check_descriptor", "")
2017-08-31 10:43:39 632641 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078410 - warning - exception while closing, too little too late: Unix.Unix_error(Unix.EBADF, "check_descriptor", "")
2017-08-31 10:43:39 632652 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078411 - warning - exception while closing, too little too late: Unix.Unix_error(Unix.EBADF, "check_descriptor", "")
2017-08-31 10:43:39 632672 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078412 - info - messaging_172.17.16.11_62: closing
2017-08-31 10:43:39 632686 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078413 - info - messaging_172.17.16.11_50: closing
2017-08-31 10:43:39 632696 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078414 - info - messaging_172.17.16.11_57: closing
2017-08-31 10:43:39 632745 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078415 - info - Exception in client thread messaging_172.17.16.11_62: Lwt.Canceled
2017-08-31 10:43:39 632755 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078416 - info - Exception in client thread messaging_172.17.16.11_50: Lwt.Canceled
2017-08-31 10:43:39 632762 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078417 - info - Exception in client thread messaging_172.17.16.11_57: Lwt.Canceled
2017-08-31 10:43:39 632775 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078418 - info - waiting for 1 client_threads
2017-08-31 10:43:39 632796 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078419 - info - shutting down server on port 26451
2017-08-31 10:43:41 070204 -0400 - NY1SRV0008 - 7929/0000 - arakoon - 2078420 - info - Crash log dumped
ovs-arakoon-ny1-hddbackend04-nsm_04.service: Main process exited, code=exited, status=1/FAILURE
wimpers commented 6 years ago

Set to Roadmap as should only occur when run manually. We internally already protect against this. Won't fix in this case might be a bit too drastic.