mattshma / bigdata

hadoop,hbase,storm,spark,etc..
161 stars 79 forks source link

Initialization failed for Block pool <registering> (Datanode Uuid unassigned) #24

Closed mattshma closed 8 years ago

mattshma commented 8 years ago

由于很早前rebalancer速度太慢,加了几个虚拟机机器,然后将datanode中磁盘使用少的机器以nfs形式挂载到虚拟机上,再做rebalancer,后来磁盘使用率够了,将nfs umount了,重新启动这几个datanode,报错如下:

2016-05-06 11:23:54,378 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to NameNodeSlave/192.168.1.13:8022. Exiting.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /hadoop/dfs1/dn is in an inconsistent state: Root /hadoop/dfs1/dn: DatanodeUuid=c03adb14-ea74-4bab-9f6f-028daed124ed, does not match 840f4958-d18a-4b9d-a451-c768595f8304 from other StorageDirectory.

看报警原因是 DatanodeUuid不匹配导致。如下:

[root@jh02-024 ~]# cat /hadoop/dfs1/dn/current/VERSION
#Wed Mar 30 13:48:07 CST 2016
storageID=DS-900ea2fd-487b-4110-a561-6c4664c2c359
clusterID=cluster24
cTime=0
datanodeUuid=c03adb14-ea74-4bab-9f6f-028daed124ed
storageType=DATA_NODE
layoutVersion=-55
[root@jh02-024 ~]# cat /hadoop/dfs/dn/current/VERSION
#Fri May 06 11:18:07 CST 2016
storageID=DS-6103842c-dda8-484d-9ec8-ce1657dd688f
clusterID=cluster24
cTime=0
datanodeUuid=840f4958-d18a-4b9d-a451-c768595f8304
storageType=DATA_NODE
layoutVersion=-55

看下HDFS-5233中storageid和datanodeuuid的定义:

StorageID currently identifies both a Datanode and a storage attached to the Datanode. Once we start tracking multiple storages per datanode we would like to deprecate StorageID in favor of a DatanodeUuid for the datanode and a StorageUuid per storage.

即每个datanode有唯一的datanodeuuid,datanode上的每个存储有各一个唯一对应的storageid。

更多说明如下:

Datanode UUID that this storage is currently attached to. This is the same as the legacy StorageID for datanodes that were upgraded from a pre-UUID version. For compatibility with prior versions of Datanodes we cannot make this field a UUID.

The registering datanode is a replacement node for the existing data storage, which from now on will be served by a new node. If this message repeats, both nodes might have same storageID by (insanely rare) random chance. User needs to restart one of the nodes with its data cleared (or user can just remove the StorageID value in "VERSION" file under the data directory of the datanode, but this is might not work if VERSION file format has changed

所以这里修改下VERSION中错误的datanodeuuid,然后重启cloudera-scm-agent即可。

注:注意不要删除VERSION文件,否则该VERSION同目录下的文件会在重启时被删除;另外,若修改datanodeuuid后,删除storageid的话,会因VERSION文件格式不对而启动失败。

mattshma commented 8 years ago

上面方法有误。

对于磁盘A,其挂载在/hadoop上,若有/hadoop/dfs和/hadoop/dfs1两个目录,/hadoop/dfs用于该datanode使用,/hadoop/dfs1被NFS挂载,若现在umount nfs,不应该只是单纯的改/hadoop/dfs1的datanodeuuid,让DataNode用一块盘的两个目录。而是应该一个DN用一块盘,两个目录的数据需要进行合并,原因见 #4。

正常的做法是将数据在磁盘内做移动。然后重启datanode服务即可。

磁盘内迁移脚本如下:

#/bin/bash
for k in {0..11}
do
  if [ $k -eq 0 ]; then
    k=""
  fi
  rbwdir=dn/current/BP-1471860497-10.2.72.29-1421306158975/current/rbw
  ddir=dn/current/BP-1471860497-10.2.72.29-1421306158975/current/finalized
  for i in {0..64}
  do
    echo /hadoop$k/subdir$i
    for j in {0..64}
    do
    #  echo subdir$i
      dndir=${ddir}/subdir$i
      dnsdir=${dndir}/subdir$j
      if [ -d /hadoop$k/dfs1/$dnsdir ] && [ "$(ls -A /hadoop$k/dfs1/$dnsdir)" ];then
         if [ -d /hadoop$k/dfs/$dnsdir ];then
           mv /hadoop$k/dfs1/$dnsdir/blk_* /hadoop$k/dfs/$dnsdir
         else
       mv /hadoop$k/dfs1/$dnsdir /hadoop$k/dfs/$dndir
         fi
      fi
    done
    if [ -d /hadoop$k/dfs1/$dndir ] && [ "$(ls -A /hadoop$k/dfs1/$dndir)" ];then
      mv /hadoop$k/dfs1/$dndir/blk_* /hadoop$k/dfs/$dndir
    fi
  done
  if [ -d /hadoop$k/dfs1/$ddir ] && [ "$(ls -A /hadoop$k/dfs1/$ddir)" ];then
    mv /hadoop$k/dfs1/$ddir/blk_* /hadoop$k/dfs/$ddir
  fi
  if [ "$(ls -A /hadoop$k/dfs/$rbwdir)" ];then
    mv /hadoop$k/dfs1/$rbwdir/* /hadoop$k/dfs/$rbwdir
  fi
done