portworx / px-dev

PX-Developer is scale-out storage for containers. Run Cassandra, Jenkins, or any application in Docker, with enterprise storage functionality on commodity servers
270 stars 83 forks source link

recover node after reboot #26

Closed maikotz closed 7 years ago

maikotz commented 7 years ago

Hi there,

after rebooting my px-dev host can't start the storage and is stuck in "initializing". Is there any way to force quorum and enable / start this host?

I'm running px version: pxctl version 1.1.6-cb1bbeb

In the px-dev container I get some warning about locks and after w while the container restarts

time="2017-04-26T20:46:11Z" level=warning msg="Lock pwx/5ac2ed6f-7e4e-4e1d-8e8c-3a6df1fb61a7/storage/locks/1056637145994285341.lock locked for 285 seconds, tag: {NodeID:94e7b4e6-6bf8-41a5-bfa0-45bfe6afdd22,FuncID:postVolumeUsage}"

time="2017-04-26T20:46:25Z" level=warning msg="Failed to acquire kvdb lock: Key already exists" Error="Key already exists" Function=volumeStateHandler Notification=&{281474976710698 751556 {1056637145994285341 0 0 2 [0xc8209b6c80] false false map[]}}
time="2017-04-26T20:46:25Z" level=warning msg="VolumeState: {281474976710698 751556 {1056637145994285341 0 0 2 [0xc8209b6c80] false false map[]}}" Error="Failed to acquire kvdb lock: Key already exists" Function=processNotification Notification=&{64 0001-01-01 00:00:00 +0000 UTC 0 {281474976710698 751556 {1056637145994285341 0 0 2 [0xc8209b6c80] false false map[]}}}
time="2017-04-26T20:46:25Z" level=warning Driver=kernel Function=VolumeNotifyFail OpId=281474976710698 Status=-1
time="2017-04-26 20:46:25Z" level=INFO msg="void ReplicationSet::block_state_notify_failed(uint64_t) token: 281474976710698"
time="2017-04-26 20:46:25Z" level=INFO msg="update_cdb: dev: 1056637145994285341 rset: 0 node[ 0 ] curr[ 0 ] next[ 0 ] new_rset [ empty ] remove [ empty ] pool_ids [ 0 ]  new_pool_ids [ empty ]"
time="2017-04-26T20:46:25Z" level=info msg="volumeStateHandler update" AbortOnError=false BackgroundProcessing=false Driver=pxd Error=<nil> Format=FS_TYPE_EXT4 Function=d.volumePut ID=657389255672230477 State=VOLUME_STATE_ATTACHED Version=751393
time="2017-04-26T20:46:25Z" level=info msg="Action: 2 data <nil>" AttachedOn=94e7b4e6-6bf8-41a5-bfa0-45bfe6afdd22 Driver=kernel Error=<nil> Function=VolumeStateChange ID=657389255672230477 State=VOLUME_STATE_ATTACHED Version=751675
time="2017-04-26T20:46:25Z" level=error msg="Failed to acquire kvdb lock Key already exists" Driver=pxd Error=<nil> Function=refreshAttachInfo
time="2017-04-26T20:46:25Z" level=error msg="Unable to start node.  Error while loading volume pxd because of: Key already exists"
time="2017-04-26T20:46:25Z" level=warning msg="Failed to initialize Join PX Storage Service: Key already exists"
time="2017-04-26T20:46:25Z" level=error msg="Failed to join cluster. Key already exists"
time="2017-04-26T20:46:25Z" level=error msg="Could not start cluster manager because of: Key already exists"

pwxctl status says:

$ sudo /opt/pwx/bin/pxctl status
Status: PX is initializing...
Node ID: 94e7b4e6-6bf8-41a5-bfa0-45bfe6afdd22
    IP: 10.10.23.92
    Local Storage Pool: 1 pool
    POOL    IO_PRIORITY SIZE    USED    STATUS  ZONE    REGION
    0   HIGH        750 GiB 9.0 GiB Online  default default
    Local Storage Devices: 1 device
    Device  Path        Media Type      Size        Last-Scan
    0:1 /dev/loop0  STORAGE_MEDIUM_MAGNETIC 750 GiB     26 Apr 17 20:36 UTC
    total           -           750 GiB
Cluster Summary
    Cluster ID: 5ac2ed6f-7e4e-4e1d-8e8c-3a6df1fb61a7
    IP      ID                  Used    Capacity    Status
    10.10.23.92 94e7b4e6-6bf8-41a5-bfa0-45bfe6afdd22    0 B 0 B     Initializing (This node)
Global Storage Pool
    Total Used      :  0 B
    Total Capacity  :  0 B

cluster list:

$ sudo /opt/pwx/bin/pxctl c l
Cluster ID: 5ac2ed6f-7e4e-4e1d-8e8c-3a6df1fb61a7
Status: Not in Quorum

Nodes in the cluster:
ID                  DATA IP     CPU MEM TOTAL   MEM FREE    CONTAINERS  VERSION     STATUS
94e7b4e6-6bf8-41a5-bfa0-45bfe6afdd22    10.10.23.92 0.425   68 GB       65 GB       N/A     1.1.6-cb1bbeb   Initializing

Is this because it is not a 3 node cluster and therefore no quorum can be reached or is this just a side effect since pwx can not initialize the storage because of the .lock files?

maikotz commented 7 years ago

Just a quick feedback on this one.

Inspecting the pwx container didn't really help, I could not find out where the .lock files where located to manually delete them and try to initialise the node manually. Even deleting the BTRFS subvolume didn't help, so I looked into just recovering the data from the test containers.

This was surprisingly easy, as everything is just based on BTRFS, so I mounted my device (/dev/loop here) and found all created volumes there which I could manually ext4 mount and save the data. It didn't really matter to me, but might be helpful for someone else if in need.

sudo mount /data/storage.bin /mnt
mkdir /tmp/mnts/; for i in [0-9]*; do mkdir /tmp/mnts/${i}; sudo mount ${i}/pxdev /tmp/mnts/${i}; done
venkatpx commented 7 years ago

@maikotz sorry for the delay in getting to this. Do you have still have the logs for the px container in this case?