linkedin / dynamometer

A tool for scale and performance testing of HDFS with a specific focus on the NameNode.
BSD 2-Clause "Simplified" License
131 stars 34 forks source link

Zombie simulated datanode #81

Open fengnanli opened 5 years ago

fengnanli commented 5 years ago

After running start-dynamometer-cluster.sh and replay the prod audit log for some time, some simulated datanodes (containers) lost connection to the RM and when the Yarn application is killed, these containers are still running, which will sending their blocks to the Namenode. In this case, since datanode has gone through some changes with the replay where Namenode started from a fresh fsimage. Below errors will show up in the webhdfs page after the Namenode starts up.

Safe mode is ON. The reported blocks 1526116 needs additional 395902425 blocks to reach the threshold 0.9990 of total blocks 397826363. The number of live datanodes 3 has reached the minimum number 0. Name node detected blocks with generation stamps in future. This means that Name node metadata is inconsistent.This can happen if Name node metadata files have been manually replaced. Exiting safe mode will cause loss of 7141 byte(s). Please restart name node with right metadata or use "hdfs dfsadmin -safemode forceExitif you are certain that the NameNode was started with thecorrect FsImage and edit logs. If you encountered this duringa rollback, it is safe to exit with -safemode forceExit.

and checking datanode tab in the webhdfs page, a list of a couple datanodes will show up.

xkrogen commented 5 years ago

Thanks for reporting this @fengnanli ! I think I asked before but I don't remember your answer, was this running within a secure environment using LinuxContainerExecutor / cgroups? I think that is what prevents such things from occurring in our environment.