Closed mattshma closed 8 years ago
删除zookeeper中的/hbase时,报错:Node does not exist: /hbase/replication/rs/
。将/hbase/replication/rs/
的目录一个一个删掉。再删除/hbase,成功。
重启集群。可以看到region num随着时间的流逝慢慢增加。问题解决。
又出现问题,如下:
16/04/13 17:13:25 INFO util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 00bd9257d29291cac5913653076aab5f, NAME => 'imsdk_private_message,135-2171310277-4627248500237336577-000000000054,1459855573110.00bd9257d29291cac5913653076aab5f.', STARTKEY => '135-2171310277-4627248500237336577-000000000054', ENDKEY => '135-2171310292-4627389251458105345-000000000927'}
16/04/13 17:13:26 INFO client.HConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
16/04/13 17:13:26 INFO client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x152b5392f430db7
16/04/13 17:13:26 INFO zookeeper.ZooKeeper: Session: 0x152b5392f430db7 closed
16/04/13 17:13:26 INFO zookeeper.ClientCnxn: EventThread shut down
Exception in thread "main" java.io.IOException: Region {ENCODED => 00bd9257d29291cac5913653076aab5f, NAME => 'imsdk_private_message,135-2171310277-4627248500237336577-000000000054,1459855573110.00bd9257d29291cac5913653076aab5f.', STARTKEY => '135-2171310277-4627248500237336577-000000000054', ENDKEY => '135-2171310292-4627389251458105345-000000000927'} failed to move out of transition within timeout 120000ms
at org.apache.hadoop.hbase.util.HBaseFsckRepair.waitUntilAssigned(HBaseFsckRepair.java:139)
at org.apache.hadoop.hbase.util.HBaseFsck.tryAssignmentRepair(HBaseFsck.java:1732)
at org.apache.hadoop.hbase.util.HBaseFsck.checkRegionConsistency(HBaseFsck.java:1873)
at org.apache.hadoop.hbase.util.HBaseFsck.checkAndFixConsistency(HBaseFsck.java:1559)
at org.apache.hadoop.hbase.util.HBaseFsck.onlineConsistencyRepair(HBaseFsck.java:465)
at org.apache.hadoop.hbase.util.HBaseFsck.onlineHbck(HBaseFsck.java:484)
at org.apache.hadoop.hbase.util.HBaseFsck.exec(HBaseFsck.java:4032)
at org.apache.hadoop.hbase.util.HBaseFsck$HBaseFsckTool.run(HBaseFsck.java:3841)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3829)
和
16/04/13 17:22:37 INFO util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 52de7a36375d495194ce4f38494dba3d, NAME => 'snsgz_log,2033310844_13038_16993216120709551398,1456143017426.52de7a36375d495194ce4f38494dba3d.', STARTKEY => '2033310844_13038_16993216120709551398', ENDKEY => '2033310844_15900_16993135666709551602'}
ERROR: Region { meta => snsgz_log,2072310357_12105_16988486054709551539,1459421803327.53c2a5d20732a6b33f486e448179b9d2., hdfs => hdfs://hadoop/hbase/data/default/snsgz_log/53c2a5d20732a6b33f486e448179b9d2, deployed => } not deployed on any region server.
Trying to fix unassigned region...
16/04/13 17:22:38 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=DataNode0003:2181,DataNode0002:2181,NameNodeMaster:2181,DataNode0006:2181,DataNode0001:2181 sessionTimeout=60000 watcher=catalogtracker-on-hconnection-0x50bba944, quorum=DataNode0003:2181,DataNode0002:2181,NameNodeMaster:2181,DataNode0006:2181,DataNode0001:2181, baseZNode=/hbase
16/04/13 17:22:39 INFO zookeeper.RecoverableZooKeeper: Process identifier=catalogtracker-on-hconnection-0x50bba944 connecting to ZooKeeper ensemble=DataNode0003:2181,DataNode0002:2181,NameNodeMaster:2181,DataNode0006:2181,DataNode0001:2181
16/04/13 17:22:39 INFO zookeeper.ClientCnxn: Opening socket connection to server DataNode0003/10.2.72.26:2181. Will not attempt to authenticate using SASL (unknown error)
16/04/13 17:22:39 INFO zookeeper.ClientCnxn: Socket connection established to DataNode0003/10.2.72.26:2181, initiating session
16/04/13 17:22:39 INFO zookeeper.ClientCnxn: Session establishment complete on server DataNode0003/10.2.72.26:2181, sessionid = 0x452b53930550f8e, negotiated timeout = 60000
16/04/13 17:22:39 INFO zookeeper.ZooKeeper: Session: 0x452b53930550f8e closed
16/04/13 17:22:39 INFO zookeeper.ClientCnxn: EventThread shut down
16/04/13 17:22:39 INFO util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 53c2a5d20732a6b33f486e448179b9d2, NAME => 'snsgz_log,2072310357_12105_16988486054709551539,1459421803327.53c2a5d20732a6b33f486e448179b9d2.', STARTKEY => '2072310357_12105_16988486054709551539', ENDKEY => '2072310358_11095_16988216153709551609'}
ERROR: Region { meta => snsgz_log,2033310800_24795_1699380242,1454163437210.541bcb6ae3c62dd5d2390d6a17284de9., hdfs => hdfs://hadoop/hbase/data/default/snsgz_log/541bcb6ae3c62dd5d2390d6a17284de9, deployed => } not deployed on any region server.
Trying to fix unassigned region...
将出现问题的region手动assign后,再次执行hbase hbck -repair
,解决问题。
对于部分处于Region still in transition, waiting for it to become assigned
状态的region ,发现assign命令对其无效,经查找是hfile已经找不到了。在hmaster:60010/table.jsp?name=MYTABLE,该表hfile已丢失的region处于not deployed的状态。
查看META中该region信息,只有info:regioninfo的信息,info:server信息不存在。既然数据已经找不到了,所以将META表中该region信息删除:
# hbase shell
> get 'hbase:meta', 'REGION_NAME'
> delete 'hbase:meta', 'REGION_NAME', 'info:regioninfo'
UPDATE: 因为hbase:mate还会读取.regioninfo信息,所以即使将META表的信息删除,但.regioninfo的信息还存在,在做repair时该信息还会出现。
接上面,在执行几次repair后,报错
ERROR: Region { meta => null, hdfs => hdfs://hadoop/hbase/data/default/imsdk_group_message/1118895dec49d8aa957eeb819e12a047, deployed => } on HDFS, but not listed in hbase:meta or deployed on any region server
ERROR: Region { meta => imsdk_group_message,135-2171310278-3-4627406830249771009-000000015814,1461725715049.3bedd318a310246654a783d0058d60bf., hdfs => null, deployed => } found in META, but not in HDFS or deployed on any region server.
ERROR: Found lingering reference file hdfs://hadoop/hbase/data/default/snsgz_log/e7e05884594e8a6b821aeee748d21513/info/a5f57cbff1724924b65b9c7bb3878027.f5e336212ec01cce70623ef0b640aa69
ERROR: Found lingering reference file hdfs://hadoop/hbase/data/default/snsgz_log/e7e05884594e8a6b821aeee748d21513/info/0bb353c43a5f4c66b324030e9b2f2f67.f5e336212ec01cce70623ef0b640aa69
ERROR: Found lingering reference file hdfs://hadoop/hbase/data/default/imsdk_group_message/1118895dec49d8aa957eeb819e12a047/content/6dcf3bbdee744cc0bd65dab3bd836ff5.3f24581d80828508aaf60f3dc84014bf
而对于只有info:regioninfo的信息,info:server信息不存在的region,无论用hbck怎么修复都不起作用。无奈之下只能将zookeeper中/hbase
删除,将集群重启。再次修复,问题仍存在。多修复几次。问题解决。
这里有个问题: region不在hdfs的原因是什么?是由于文件丢失,还是由于split时信息更新失败?
查看thritf2日志,如下:
执行hbase hbck时,报错如下: