sequenceiq / docker-ambari

Docker image with Ambari
291 stars 200 forks source link

HDFS block corruption when running a new container #60

Closed oguennec closed 9 years ago

oguennec commented 9 years ago

I am systematically facing HDFS / HBase block corruption when running a new container from an image of a healthy HDP cluster (single-node).

Steps followed:

Example of corruption -bash-4.1# HADOOP_USER_NAME=hdfs hdfs fsck / Connecting to namenode via http://og.mycorp.com:50070 FSCK started by hdfs (auth:SIMPLE) from /172.17.0.2 for path / at Mon Jun 08 11:33:06 EDT 2015 . /app-logs/ambari-qa/logs/application_1433328045348_0001/og.mycorp.com_45454: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741858

/app-logs/ambari-qa/logs/application_1433328045348_0001/og.mycorp.com_45454: MISSING 1 blocks of total size 7080 B.. /app-logs/ambari-qa/logs/application_1433502507205_0001/og.mycorp.com_45454: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741976_1154. Target Replicas is 3 but found 1 replica(s). . /app-logs/ambari-qa/logs/application_1433502507205_0002/og.mycorp.com_45454: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741988_1166. Target Replicas is 3 but found 1 replica(s). . /apps/hbase/data/data/default/ambarismoketest/.tabledesc/.tableinfo.0000000001: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741841_1017. Target Replicas is 3 but found 1 replica(s). . /apps/hbase/data/data/default/ambarismoketest/ac82f75a8636f78f9629dd4b480106d2/.regioninfo: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741842

/apps/hbase/data/data/default/ambarismoketest/ac82f75a8636f78f9629dd4b480106d2/.regioninfo: MISSING 1 blocks of total size 50 B.. /apps/hbase/data/data/default/ambarismoketest/ac82f75a8636f78f9629dd4b480106d2/family/0ade395e2a9b49b8a6ce711d482788d8: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741863_1039. Target Replicas is 3 but found 1 replica(s). .. /apps/hbase/data/data/hbase/meta/.tabledesc/.tableinfo.0000000001: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741828

/apps/hbase/data/data/hbase/meta/.tabledesc/.tableinfo.0000000001: MISSING 1 blocks of total size 372 B.. /apps/hbase/data/data/hbase/meta/1588230740/.regioninfo: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741827_1003. Target Replicas is 3 but found 1 replica(s). . /apps/hbase/data/data/hbase/meta/1588230740/info/8420cae8bce94280995695060a910546: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073742145_1325. Target Replicas is 3 but found 1 replica(s). .. /apps/hbase/data/data/hbase/namespace/.tabledesc/.tableinfo.0000000001: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741834

/apps/hbase/data/data/hbase/namespace/.tabledesc/.tableinfo.0000000001: MISSING 1 blocks of total size 286 B.. /apps/hbase/data/data/hbase/namespace/14115c2297e3486d8f3f4ebf785fd11d/.regioninfo: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741835_1011. Target Replicas is 3 but found 1 replica(s). . /apps/hbase/data/data/hbase/namespace/14115c2297e3486d8f3f4ebf785fd11d/info/418efc3186ad4896978913edf793cec4: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741861_1037. Target Replicas is 3 but found 1 replica(s). .. /apps/hbase/data/hbase.id: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741826

/apps/hbase/data/hbase.id: MISSING 1 blocks of total size 42 B.. /apps/hbase/data/hbase.version: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741825_1001. Target Replicas is 3 but found 1 replica(s). . /apps/hbase/data/oldWALs/og.mycorp.com%2C60020%2C1433773404945.1433773424585: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073742140

/apps/hbase/data/oldWALs/og.mycorp.com%2C60020%2C1433773404945.1433773424585: MISSING 1 blocks of total size 655 B.. /apps/hbase/data/oldWALs/og.mycorp.com%2C60020%2C1433773404945.1433773750783.meta: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073742144

/apps/hbase/data/oldWALs/og.mycorp.com%2C60020%2C1433773404945.1433773750783.meta: MISSING 1 blocks of total size 541 B.. /hdp/apps/2.2.4.2-2/hive/hive.tar.gz: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741989

/hdp/apps/2.2.4.2-2/hive/hive.tar.gz: MISSING 1 blocks of total size 83000677 B.. /hdp/apps/2.2.4.2-2/mapreduce/hadoop-streaming.jar: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741991

/hdp/apps/2.2.4.2-2/mapreduce/hadoop-streaming.jar: MISSING 1 blocks of total size 104996 B.. /hdp/apps/2.2.4.2-2/mapreduce/mapreduce.tar.gz: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741829_1005. Target Replicas is 3 but found 1 replica(s).

/hdp/apps/2.2.4.2-2/mapreduce/mapreduce.tar.gz: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741830

/hdp/apps/2.2.4.2-2/mapreduce/mapreduce.tar.gz: MISSING 1 blocks of total size 58479639 B.. /hdp/apps/2.2.4.2-2/pig/pig.tar.gz: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741990_1168. Target Replicas is 3 but found 1 replica(s). . /hdp/apps/2.2.4.2-2/tez/tez.tar.gz: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741960_1138. Target Replicas is 3 but found 1 replica(s). . /mr-history/done/2015/06/03/000000/job_1433328045348_0001-1433328283077-ambari%2Dqa-word+count-1433328323621-1-1-SUCCEEDED-default-1433328302419.jhist: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741856

/mr-history/done/2015/06/03/000000/job_1433328045348_0001-1433328283077-ambari%2Dqa-word+count-1433328323621-1-1-SUCCEEDED-default-1433328302419.jhist: MISSING 1 blocks of total size 33669 B.. /mr-history/done/2015/06/03/000000/job_1433328045348_0001_conf.xml: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741857_1033. Target Replicas is 3 but found 1 replica(s). . /mr-history/done/2015/06/05/000000/job_1433502507205_0001-1433503933474-ambari%2Dqa-PigLatin%3ApigSmoke.sh-1433503964156-1-0-SUCCEEDED-default-1433503952122.jhist: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741974_1152. Target Replicas is 3 but found 1 replica(s). . /mr-history/done/2015/06/05/000000/job_1433502507205_0001_conf.xml: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741975

/mr-history/done/2015/06/05/000000/job_1433502507205_0001_conf.xml: MISSING 1 blocks of total size 227572 B.. /tmp/id11ac4100_date410315: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741840

/tmp/id11ac4100_date410315: MISSING 1 blocks of total size 1393 B.. /user/ambari-qa/mapredsmokeinput: Under replicated BP-108620518-172.17.0.65-1433327686475:blk_1073741847_1023. Target Replicas is 3 but found 1 replica(s). .. /user/ambari-qa/mapredsmokeoutput/part-r-00000: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741854

/user/ambari-qa/mapredsmokeoutput/part-r-00000: MISSING 1 blocks of total size 1475 B.. /user/ambari-qa/passwd: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741977

/user/ambari-qa/passwd: MISSING 1 blocks of total size 1521 B... /user/ambari-qa/pigsmoke.out/part-v000-o000-r-00000: CORRUPT blockpool BP-108620518-172.17.0.65-1433327686475 block blk_1073741987

/user/ambari-qa/pigsmoke.out/part-v000-o000-r-00000: MISSING 1 blocks of total size 207 B.Status: CORRUPT Total size: 414441608 B Total dirs: 8591 Total files: 35 Total symlinks: 0 Total blocks (validated): 31 (avg. block size 13369084 B)


CORRUPT FILES: 16 MISSING BLOCKS: 16 MISSING SIZE: 141860175 B CORRUPT BLOCKS: 16


Minimally replicated blocks: 15 (48.387096 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 15 (48.387096 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 0.48387095 Corrupt blocks: 16 Missing replicas: 30 (32.258064 %) Number of data-nodes: 1 Number of racks: 1 FSCK ended at Mon Jun 08 11:33:06 EDT 2015 in 605 milliseconds

The filesystem under path '/' is CORRUPT -bash-4.1#

oguennec commented 9 years ago

I have solved this issue by adding the --volumes-from initial_container option when running the second container.

I had a closer look at the Dockerfile from sequenceiq/ambari Docker image and found out it contains a VOLUME /var/log instruction. Upon creation of the cluster HDP files were extensively saved in this location.