mesosphere-backup / hdfs-deprecated

[DEPRECATED] This project is deprecated. It will be archived on December 1, 2017.
Apache License 2.0
147 stars 52 forks source link

Framework startup issues #254

Open hansman opened 8 years ago

hansman commented 8 years ago

We have been trying to bring up the mesos hdfs framework on a 8 machine (4cpus each, 8GB ram each) mesos cluster. There are some problems starting it up that appear hard to debug.

Our config config/hdfs-site.xml and config/mesos-site.xml as defaulted in the repository. We override the values with the following values:

_export JAVA_HOME=/usr/lib/jvm/{ourJDK}/jre export MESOS_HDFS_STATE_ZK=app-zk1-groot.service.local:2182,app-zk2-groot.service.local:2182,app-zk3-groot.service.local:2182 export MESOS_MASTER_URI=zookeeper.service.local:2181/mesos export MESOS_HDFS_ZKFC_HA_ZOOKEEPER_QUORUM=app-zk1-groot.service.local:2182,app-zk2-groot.service.local:2182,app-zk3-groot.service.local:2182 export MESOS_HDFS_JVM_OVERHEAD=0.4 export MESOS_HDFS_NAMENODE_HEAP_SIZE=512 export MESOS_HDFS_EXECUTOR_CPUS=0.7 export MESOS_HDFS_NAMENODE_CPUS=0.7 export MESOS_HDFS_JOURNALNODE_CPUS=0.7 export MESOS_HDFS_DATANODE_CPUS=0.7 export MESOS_HDFS_JVM_OVERHEAD=0.4 export MESOS_HDFS_HADOOP_HEAP_SIZE=256 export MESOS_HDFS_EXECUTOR_HEAP_SIZE=256 export MESOS_HDFS_DATANODE_HEAPSIZE=256

mesosdns is not enabled

How we launch

Not as a marathon task. Not dockerized. just sh bin/hdfs-mesos

Blocking problems

1) Could not download hdfs-mesos-executor-0.1.6.tgz
Had to set mesos.hdfs.framework.hostaddress to the framework scheduler explicitly journalnode were being launched but pointed to localhost as 'config server'.

2) No datanode, no zkfc tasks launched successfully

When it tries to launch datanode or zkfc following error:

FATAL ha.ZKFailoverController (ZKFailoverController.java:doRun(213)) - Unable to start failover controller. Parent znode does not exist. Run with -formatZK flag to initialize ZooKeeper

3) On a separate environment (same parameters) the framework does not startup due to a socket error

016-04-13 01:46:43,426:8395(0x7f60c6ff5700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.0.104:2182] zk retcode=-4, errno=112(Host is down): failed while receiving a server response