mattshma / bigdata

hadoop,hbase,storm,spark,etc..
161 stars 79 forks source link

hbase noninsistence #13

Closed mattshma closed 8 years ago

mattshma commented 8 years ago

查看thritf2日志,如下:

16/04/02 15:39:35 INFO client.AsyncProcess: #24, table=snsgz_log, attempt=10/35 failed 123 ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region snsgz_log,2021310323_19873_169978699469399,1451498328433.2bbccd941ba82143c8af9f0f53874ca5. is not online on 10-2-96-38.dn-hadoop-platform.dh.idccom,60.020,1459582403603
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2673)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4107)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3341)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29503)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
        at java.lang.Thread.run(Thread.java:745)
 on 10-2-96-38.dn-hadoop-platform.dh.idc.com,60020,1457939257213, tracking started Sat Apr 02 15:39:06 CST 2016, retrying after 10065 ms, replay 123 ops.
16/04/02 15:39:35 INFO client.AsyncProcess: #19, waiting for some tasks to finish. Expected max=0, tasksSent=44, tasksDone=35, currentTasksDone=35, retries=42 hasError=false, tableName=snsgz_log

执行hbase hbck时,报错如下:

ERROR: hbase:meta is not found on any region.
ERROR: hbase:meta table is not consistent. Run HBCK with proper fix options to fix hbase:meta inconsistency. Exiting...
16/04/02 16:47:49 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
。。。。
16/04/02 16:39:32 WARN hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.io.IOException: Got error for OP_READ_BLOCK, self=/10.2.96.44:49503, remote=/10.2.96.4:50010, for file /hbase/data/default/snsgz_log/f2fac5f7a10d33cbfc3db8783ed0d9bc/.regioninfo, for pool BP-1471860497-10.2.72.29-1421306158975 block 1236581040_162846066
        at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:432)
        at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:397)
        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786)
        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665)
        at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:566)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:789)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.apache.hadoop.hbase.HRegionInfo.parseFrom(HRegionInfo.java:1090)
        at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.loadRegionInfoFileContent(HRegionFileSystem.java:714)
        at org.apache.hadoop.hbase.util.HBaseFsck.loadHdfsRegioninfo(HBaseFsck.java:875)
        at org.apache.hadoop.hbase.util.HBaseFsck.access$2300(HBaseFsck.java:169)
        at org.apache.hadoop.hbase.util.HBaseFsck$WorkItemHdfsRegionInfo.call(HBaseFsck.java:3501)
        at org.apache.hadoop.hbase.util.HBaseFsck$WorkItemHdfsRegionInfo.call(HBaseFsck.java:3485)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
16/04/02 16:39:32 WARN hdfs.DFSClient: Failed to connect to /10.2.96.4:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/10.2.96.44:49503, remote=/10.2.96.4:50010, for file /hbase/data/default/snsgz_log/f2fac5f7a10d33cbfc3db8783ed0d9bc/.regioninfo, for pool BP-1471860497-10.2.72.29-1421306158975 block 1236581040_162846066
java.io.IOException: Got error for OP_READ_BLOCK, self=/10.2.96.44:49503, remote=/10.2.96.4:50010, for file /hbase/data/default/snsgz_log/f2fac5f7a10d33cbfc3db8783ed0d9bc/.regioninfo, for pool BP-1471860497-10.2.72.29-1421306158975 block 1236581040_162846066
        at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:432)
        at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:397)
        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786)
        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665)
        at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:566)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:789)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.apache.hadoop.hbase.HRegionInfo.parseFrom(HRegionInfo.java:1090)
        at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.loadRegionInfoFileContent(HRegionFileSystem.java:714)
        at org.apache.hadoop.hbase.util.HBaseFsck.loadHdfsRegioninfo(HBaseFsck.java:875)
        at org.apache.hadoop.hbase.util.HBaseFsck.access$2300(HBaseFsck.java:169)
        at org.apache.hadoop.hbase.util.HBaseFsck$WorkItemHdfsRegionInfo.call(HBaseFsck.java:3501)
        at org.apache.hadoop.hbase.util.HBaseFsck$WorkItemHdfsRegionInfo.call(HBaseFsck.java:3485)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
16/04/02 16:39:32 INFO hdfs.DFSClient: Successfully connected to /10.2.96.36:50010 for BP-1471860497-10.2.72.29-1421306158975:blk_1236581040_162846066
。。。。。。。
ERROR: Region { meta => snsgz_log,2033310748_12157_16997280660709551244,1452043911093.0a5284370188266a81ef8810e6499810., hdfs => hdfs://hadoop/hbase/data/default/snsgz_log/0a5284370188266a81ef8810e6499810, deployed =>  } not deployed on any region server.
Trying to fix unassigned region...
16/04/02 16:34:39 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=DataNode0003:2181,DataNode0002:2181,NameNodeMaster:2181,DataNode0006:2181,DataNode0001:2181 sessionTimeout=60000 watcher=catalogtracker-on-hconnection-0xacda305, quorum=DataNode0003:2181,DataNode0002:2181,NameNodeMaster:2181,DataNode0006:2181,DataNode0001:2181, baseZNode=/hbase
16/04/02 16:34:39 INFO zookeeper.RecoverableZooKeeper: Process identifier=catalogtracker-on-hconnection-0xacda305 connecting to ZooKeeper ensemble=DataNode0003:2181,DataNode0002:2181,NameNodeMaster:2181,DataNode0006:2181,DataNode0001:2181
16/04/02 16:34:39 INFO zookeeper.ClientCnxn: Opening socket connection to server DataNode0003/10.2.72.26:2181. Will not attempt to authenticate using SASL (unknown error)
16/04/02 16:34:39 INFO zookeeper.ClientCnxn: Socket connection established to DataNode0003/10.2.72.26:2181, initiating session
16/04/02 16:34:39 INFO zookeeper.ClientCnxn: Session establishment complete on server DataNode0003/10.2.72.26:2181, sessionid = 0x452b5393054553c, negotiated timeout = 60000
16/04/02 16:34:39 INFO zookeeper.ZooKeeper: Session: 0x452b5393054553c closed
16/04/02 16:34:39 INFO zookeeper.ClientCnxn: EventThread shut down
ERROR: Region { meta => snsgz_log,2033310632_13958_1700048359518092,1450102397273.0a61448b0cf1be70793d9a1549fc8bb3., hdfs => hdfs://hadoop/hbase/data/default/snsgz_log/0a61448b0cf1be70793d9a1549fc8bb3, deployed =>  } not deployed on any region server.
Trying to fix unassigned region...

。。。。。
INFO util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => c83b038828ffe4400e4c48dcddd79bcb, NAME => 'snsgz_log,2021310212_20522_1700335458701255,1450197643183.c83b038828ffe4400e4c48dcddd79bcb.', STARTKEY => '2021310212_20522_1700335458701255', ENDKEY => '2021310212_21937_170060207778164'}
。。。。。
16/04/02 16:33:01 INFO util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0a669e2c97b251fa5927c2af3db342b1, NAME => 'snsgz_log,2033310272_11167_170095583277971,1442621808680.0a669e2c97b251fa5927c2af3db342b1.', STARTKEY => '2033310272_11167_170095583277971', ENDKEY => '2033310272_12912_170151266520407'}
16/04/02 16:33:02 INFO client.HConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
16/04/02 16:33:02 INFO client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x552b539305b4db1
16/04/02 16:33:02 INFO zookeeper.ZooKeeper: Session: 0x552b539305b4db1 closed
16/04/02 16:33:02 INFO zookeeper.ClientCnxn: EventThread shut down
Exception in thread "main" java.io.IOException: Region {ENCODED => 0a669e2c97b251fa5927c2af3db342b1, NAME => 'snsgz_log,2033310272_11167_170095583277971,1442621808680.0a669e2c97b251fa5927c2af3db342b1.', STARTKEY => '2033310272_11167_170095583277971', ENDKEY => '2033310272_12912_170151266520407'} failed to move out of transition within timeout 120000ms
        at org.apache.hadoop.hbase.util.HBaseFsckRepair.waitUntilAssigned(HBaseFsckRepair.java:139)
        at org.apache.hadoop.hbase.util.HBaseFsck.tryAssignmentRepair(HBaseFsck.java:1732)
        at org.apache.hadoop.hbase.util.HBaseFsck.checkRegionConsistency(HBaseFsck.java:1873)
        at org.apache.hadoop.hbase.util.HBaseFsck.checkAndFixConsistency(HBaseFsck.java:1559)
        at org.apache.hadoop.hbase.util.HBaseFsck.onlineConsistencyRepair(HBaseFsck.java:465)
        at org.apache.hadoop.hbase.util.HBaseFsck.onlineHbck(HBaseFsck.java:484)
        at org.apache.hadoop.hbase.util.HBaseFsck.exec(HBaseFsck.java:4032)
        at org.apache.hadoop.hbase.util.HBaseFsck$HBaseFsckTool.run(HBaseFsck.java:3841)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3829)
mattshma commented 8 years ago

删除zookeeper中的/hbase时,报错:Node does not exist: /hbase/replication/rs/。将/hbase/replication/rs/的目录一个一个删掉。再删除/hbase,成功。

重启集群。可以看到region num随着时间的流逝慢慢增加。问题解决。

mattshma commented 8 years ago

又出现问题,如下:

16/04/13 17:13:25 INFO util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 00bd9257d29291cac5913653076aab5f, NAME => 'imsdk_private_message,135-2171310277-4627248500237336577-000000000054,1459855573110.00bd9257d29291cac5913653076aab5f.', STARTKEY => '135-2171310277-4627248500237336577-000000000054', ENDKEY => '135-2171310292-4627389251458105345-000000000927'}
16/04/13 17:13:26 INFO client.HConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
16/04/13 17:13:26 INFO client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x152b5392f430db7
16/04/13 17:13:26 INFO zookeeper.ZooKeeper: Session: 0x152b5392f430db7 closed
16/04/13 17:13:26 INFO zookeeper.ClientCnxn: EventThread shut down
Exception in thread "main" java.io.IOException: Region {ENCODED => 00bd9257d29291cac5913653076aab5f, NAME => 'imsdk_private_message,135-2171310277-4627248500237336577-000000000054,1459855573110.00bd9257d29291cac5913653076aab5f.', STARTKEY => '135-2171310277-4627248500237336577-000000000054', ENDKEY => '135-2171310292-4627389251458105345-000000000927'} failed to move out of transition within timeout 120000ms
    at org.apache.hadoop.hbase.util.HBaseFsckRepair.waitUntilAssigned(HBaseFsckRepair.java:139)
    at org.apache.hadoop.hbase.util.HBaseFsck.tryAssignmentRepair(HBaseFsck.java:1732)
    at org.apache.hadoop.hbase.util.HBaseFsck.checkRegionConsistency(HBaseFsck.java:1873)
    at org.apache.hadoop.hbase.util.HBaseFsck.checkAndFixConsistency(HBaseFsck.java:1559)
    at org.apache.hadoop.hbase.util.HBaseFsck.onlineConsistencyRepair(HBaseFsck.java:465)
    at org.apache.hadoop.hbase.util.HBaseFsck.onlineHbck(HBaseFsck.java:484)
    at org.apache.hadoop.hbase.util.HBaseFsck.exec(HBaseFsck.java:4032)
    at org.apache.hadoop.hbase.util.HBaseFsck$HBaseFsckTool.run(HBaseFsck.java:3841)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3829)

16/04/13 17:22:37 INFO util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 52de7a36375d495194ce4f38494dba3d, NAME => 'snsgz_log,2033310844_13038_16993216120709551398,1456143017426.52de7a36375d495194ce4f38494dba3d.', STARTKEY => '2033310844_13038_16993216120709551398', ENDKEY => '2033310844_15900_16993135666709551602'}
ERROR: Region { meta => snsgz_log,2072310357_12105_16988486054709551539,1459421803327.53c2a5d20732a6b33f486e448179b9d2., hdfs => hdfs://hadoop/hbase/data/default/snsgz_log/53c2a5d20732a6b33f486e448179b9d2, deployed =>  } not deployed on any region server.
Trying to fix unassigned region...
16/04/13 17:22:38 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=DataNode0003:2181,DataNode0002:2181,NameNodeMaster:2181,DataNode0006:2181,DataNode0001:2181 sessionTimeout=60000 watcher=catalogtracker-on-hconnection-0x50bba944, quorum=DataNode0003:2181,DataNode0002:2181,NameNodeMaster:2181,DataNode0006:2181,DataNode0001:2181, baseZNode=/hbase
16/04/13 17:22:39 INFO zookeeper.RecoverableZooKeeper: Process identifier=catalogtracker-on-hconnection-0x50bba944 connecting to ZooKeeper ensemble=DataNode0003:2181,DataNode0002:2181,NameNodeMaster:2181,DataNode0006:2181,DataNode0001:2181
16/04/13 17:22:39 INFO zookeeper.ClientCnxn: Opening socket connection to server DataNode0003/10.2.72.26:2181. Will not attempt to authenticate using SASL (unknown error)
16/04/13 17:22:39 INFO zookeeper.ClientCnxn: Socket connection established to DataNode0003/10.2.72.26:2181, initiating session
16/04/13 17:22:39 INFO zookeeper.ClientCnxn: Session establishment complete on server DataNode0003/10.2.72.26:2181, sessionid = 0x452b53930550f8e, negotiated timeout = 60000
16/04/13 17:22:39 INFO zookeeper.ZooKeeper: Session: 0x452b53930550f8e closed
16/04/13 17:22:39 INFO zookeeper.ClientCnxn: EventThread shut down
16/04/13 17:22:39 INFO util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 53c2a5d20732a6b33f486e448179b9d2, NAME => 'snsgz_log,2072310357_12105_16988486054709551539,1459421803327.53c2a5d20732a6b33f486e448179b9d2.', STARTKEY => '2072310357_12105_16988486054709551539', ENDKEY => '2072310358_11095_16988216153709551609'}
ERROR: Region { meta => snsgz_log,2033310800_24795_1699380242,1454163437210.541bcb6ae3c62dd5d2390d6a17284de9., hdfs => hdfs://hadoop/hbase/data/default/snsgz_log/541bcb6ae3c62dd5d2390d6a17284de9, deployed =>  } not deployed on any region server.
Trying to fix unassigned region...

将出现问题的region手动assign后,再次执行hbase hbck -repair,解决问题。

参考:

mattshma commented 8 years ago

对于部分处于Region still in transition, waiting for it to become assigned状态的region ,发现assign命令对其无效,经查找是hfile已经找不到了。在hmaster:60010/table.jsp?name=MYTABLE,该表hfile已丢失的region处于not deployed的状态。

查看META中该region信息,只有info:regioninfo的信息,info:server信息不存在。既然数据已经找不到了,所以将META表中该region信息删除:

# hbase shell
> get 'hbase:meta', 'REGION_NAME'
> delete 'hbase:meta', 'REGION_NAME', 'info:regioninfo'

UPDATE: 因为hbase:mate还会读取.regioninfo信息,所以即使将META表的信息删除,但.regioninfo的信息还存在,在做repair时该信息还会出现。

mattshma commented 8 years ago

接上面,在执行几次repair后,报错

ERROR: Region { meta => null, hdfs => hdfs://hadoop/hbase/data/default/imsdk_group_message/1118895dec49d8aa957eeb819e12a047, deployed =>  } on HDFS, but not listed in hbase:meta or deployed on any region server
ERROR: Region { meta => imsdk_group_message,135-2171310278-3-4627406830249771009-000000015814,1461725715049.3bedd318a310246654a783d0058d60bf., hdfs => null, deployed =>  } found in META, but not in HDFS or deployed on any region server.
ERROR: Found lingering reference file hdfs://hadoop/hbase/data/default/snsgz_log/e7e05884594e8a6b821aeee748d21513/info/a5f57cbff1724924b65b9c7bb3878027.f5e336212ec01cce70623ef0b640aa69
ERROR: Found lingering reference file hdfs://hadoop/hbase/data/default/snsgz_log/e7e05884594e8a6b821aeee748d21513/info/0bb353c43a5f4c66b324030e9b2f2f67.f5e336212ec01cce70623ef0b640aa69
ERROR: Found lingering reference file hdfs://hadoop/hbase/data/default/imsdk_group_message/1118895dec49d8aa957eeb819e12a047/content/6dcf3bbdee744cc0bd65dab3bd836ff5.3f24581d80828508aaf60f3dc84014bf

而对于只有info:regioninfo的信息,info:server信息不存在的region,无论用hbck怎么修复都不起作用。无奈之下只能将zookeeper中/hbase删除,将集群重启。再次修复,问题仍存在。多修复几次。问题解决。

这里有个问题: region不在hdfs的原因是什么?是由于文件丢失,还是由于split时信息更新失败?

mattshma commented 8 years ago

23 和 #41