mattshma / bigdata

hadoop,hbase,storm,spark,etc..
161 stars 79 forks source link

because current leaseholder is trying to recreate file #53

Open mattshma opened 8 years ago

mattshma commented 8 years ago

Flume日志中有如下日志:

2016-07-24 23:31:19,287 WARN com.xx.flume.sink.hdfssink.HDFSEventSink: HDFS IO error
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /user/hive/warehouse/gamelog_raw.db/log_bhrookie/game_id=181/ds=20160724/log5_sdk_data for DFSClient_NONMAPREDUCE_166890633_37 for client 10.6.25.147 because current leaseholder is trying to recreate file.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3076)

....

2016-07-25 05:25:31,411 WARN com.xx.flume.sink.hdfssink.HDFSEventSink: HDFS IO error
org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: lastBlock=blk_1103594103_30617033 of src=/user/hive/warehouse/gamelog_raw.db/log_paygift/game_id=139/ds=20160725/log5_sdk_data is not sufficiently replicated yet.

....

2016-07-25 07:58:12,427 INFO com.xx.flume.sink.filesink.AbstractHDFSWriter: FileSystem's output stream doesn't support getNumCurrentReplicas; --HDFS-826 not available; fsOut=java.io.BufferedOutputStream; err=java.lang.NoSuchMethodException: java.io.BufferedOutputStream.getNumCurrentReplicas()

......

2016-07-25 07:58:54,954 WARN org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer: Got an IOException during write!
java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
    at org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
    at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.write(AbstractNonblockingServer.java:414)
    at org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleWrite(AbstractNonblockingServer.java:221)
    at org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:210)
    at org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:158)
mattshma commented 7 years ago

发现如下操作会导致该:nm/dn所在的机器由于磁盘坏块需要重启,在重启时,先将nm和dn decommission,此时部分job仍在运行,若强行关机,会造成这个问题。

正确重启机器的过程应该是nm先decommission,待job执行完成后,再decommission dn,然后重启机器。不过即使按这个步骤来, 是否仍出现该问题还需要验证。