mesosphere-backup / hdfs-deprecated

[DEPRECATED] This project is deprecated. It will be archived on December 1, 2017.
Apache License 2.0
147 stars 52 forks source link

marathon COMMAND healthcheck failure for hdfs (DCOS_PACKAGE_FRAMEWORK_NAME:hdfs) #257

Open strzelecki-maciek opened 8 years ago

strzelecki-maciek commented 8 years ago

After 2-4 minutes i can see all data/name/journal nodes task running in mesos master UI as well as hdfs task. I see no errors in any of the above mesos-task logs

However the hdfs task - deployment task from marathon - is forever in deploying state, flicking shortly through idle/unhealthy back into deploying.

According to marathon, the COMMAND check fails. When checked manually the hdfs test command work:

[root@slave1 hdfs-mesos-0.1.9]# timeout 45s hadoop fs -ls hdfs://hdfs/ && rm -rf hdfs-framework-healthcheck && hadoop fs -rm -r -f hdfs://hdfs/hdfs-framework-healthcheck && hadoop fs -mkdir hdfs://hdfs/hdfs-framework-healthcheck && mkdir hdfs-framework-healthcheck && echo \"this is a test\" > hdfs-framework-healthcheck/test1.txt && hadoop fs -put hdfs-framework-healthcheck/test1.txt hdfs://hdfs/hdfs-framework-healthcheck && hadoop fs -get hdfs://hdfs/hdfs-framework-healthcheck/test1.txt hdfs-framework-healthcheck/test2.txt && rm -rf hdfs-framework-healthcheck && hadoop fs -rm -r -f hdfs://hdfs/hdfs-framework-healthcheck
16/05/10 14:13:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/10 14:13:17 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Found 1 items
drwxr-xr-x   - root supergroup          0 2016-05-10 14:13 hdfs://hdfs/hdfs-framework-healthcheck
16/05/10 14:13:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/10 14:13:19 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/05/10 14:13:19 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted hdfs://hdfs/hdfs-framework-healthcheck
16/05/10 14:13:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/10 14:13:21 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/05/10 14:13:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/10 14:13:24 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/05/10 14:13:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/10 14:13:26 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/05/10 14:13:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/10 14:13:29 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/05/10 14:13:29 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted hdfs://hdfs/hdfs-framework-healthcheck
[root@slave1 hdfs-mesos-0.1.9]# echo $?
0

The hdfs framework itself seems to be healthy - file transfer/listing/removal is working.

Can the output of the healthcheck COMMAND be misinterpreted by marathon?

My cluster is 10machines, centos 7.2, dcos 1.7 provisioned accorind to documentation. HDFS installation triggered via WEB-GUI from Universe -> packages -> hdfs. Default values. Between installation tries the framework was teardown, zk hadoop-ha and mesos-hdfs cleared and data directories /var/lib/hdfs on all slaves were cleared as well.

The only thing that is broken seems to be interpretation of the COMMAND output.

"protocol": "COMMAND",
    "command": {
      "value": "export PATH=$MESOS_DIRECTORY/hdfs-mesos-0.1.9/bin:$PATH && export JAVA_HOME=$MESOS_DIRECTORY/jre1.7.0_76 && timeout 45s hadoop fs -ls hdfs://hdfs/ && rm -rf hdfs-framework-healthcheck && hadoop fs -rm -r -f hdfs://hdfs/hdfs-framework-healthcheck && hadoop fs -mkdir hdfs://hdfs/hdfs-framework-healthcheck && mkdir hdfs-framework-healthcheck && echo \"this is a test\" > hdfs-framework-healthcheck/test1.txt && hadoop fs -put hdfs-framework-healthcheck/test1.txt hdfs://hdfs/hdfs-framework-healthcheck && hadoop fs -get hdfs://hdfs/hdfs-framework-healthcheck/test1.txt hdfs-framework-healthcheck/test2.txt && rm -rf hdfs-framework-healthcheck && hadoop fs -rm -r -f hdfs://hdfs/hdfs-framework-healthcheck"
    },
renou commented 8 years ago

+1