Timeout waiting for HDFS master with m5.large instances

cwithrowss commented 6 years ago

I have been unable to launch any clusters with working hdfs. I can launch clusters without installing hadoop and have tried with several apache mirror -hadoop direct downloads, which seem to download ok in browser, to no avail. After it purportedly installs HDFS and Spark, when it tries to configure HDFS master it times out waiting for it to come up. This happens every time, and I'm only launching with 1 slave.

I do not have a lot of dev experience so I apologize if I am reporting this poorly.

debug info:

Configuring ephemeral storage...
2018-07-02 12:06:22,639 - flintrock.core      - INFO  - [52.3.225.2] Installing Java 1.8...
2018-07-02 12:06:24,774 - flintrock.services  - INFO  - [34.239.112.61] Installing HDFS...
2018-07-02 12:06:29,389 - flintrock.services  - INFO  - [52.3.225.2] Installing HDFS...
2018-07-02 12:06:43,781 - flintrock.services  - INFO  - [34.239.112.61] Installing Spark...
2018-07-02 12:06:50,972 - flintrock.services  - INFO  - [52.3.225.2] Installing Spark...
2018-07-02 12:07:34,130 - flintrock.services  - INFO  - [34.239.112.61] Configuring HDFS master...
2018-07-02 12:09:33,128 - flintrock.services  - DEBUG - Timed out waiting for HDFS master to come up. Trying again...
2018-07-02 12:11:28,162 - flintrock.services  - DEBUG - Timed out waiting for HDFS master to come up. Trying again...
2018-07-02 12:13:23,036 - flintrock.services  - DEBUG - Timed out waiting for HDFS master to come up.
Do you want to terminate the 2 instances created by this operation? [Y/n]: y

Flintrock version: 0.09.0 and 0.10.0.dev0
Python version: 3.6.1
OS: OS X

nchammas commented 6 years ago

What do you see in the Hadoop logs if you don't terminate the cluster and flintrock login to the master? It should give you a clue as to why the master is failing to come up. Or maybe the HDFS master is coming up fine and it's just that Flintrock is unable to reach it via its web API for a health check.

cwithrowss commented 6 years ago

Thanks for your help. The logs are there, but there's lots of errors! Some are connection related, but I'd like to share the first one, and it confuses me the most:

2018-07-02 22:20:56,528 WARN  [main] namenode.FileJournalManager (FileJournalManager.java:startLogSegment(129)) - - Unable to start log segment 1 at /media/ephemeral0/hadoop/dfs/name/current/edits_inprogress_0000000000000000001: No space left on device
2018-07-02 22:20:56,528 ERROR [main] common.Storage (NNStorage.java:reportErrorsOnDirectory(850)) - Error reported on storage diirectory Storage Directory /media/ephemeral0/hadoop/dfs/name
2018-07-02 22:20:56,528 WARN  [main] common.Storage (NNStorage.java:reportErrorsOnDirectory(855)) - About to remove corresponding storage: /media/ephemeral0/hadoop/dfs/name
2018-07-02 22:20:56,529 ERROR [main] namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(410)) - Error: starting  log segment 1 failed for (journal JournalAndStream(mgr=FileJournalManager(root=/media/ephemeral0/hadoop/dfs/name), stream=null))
java.io.IOException: No space left on device

I don't know if this means it's not trying to store in correct space, or what. Not to go into a potentially separate issue- I'm just not sure how to confirm from logs if the HDFS did come up.

nchammas commented 6 years ago

That's a good lead.

What instance type and AMI ID are you trying to launch with?

cwithrowss commented 6 years ago

instance-type: m5.large ami: ami-97785bed # Amazon Linux, us-east-1

nchammas commented 6 years ago

Ah, that's why. Try m3.large instead.

For some reason, m5.large exposes a tiny 1 MB ephemeral drive, which Flintrock then automatically tries to use for HDFS. I'll have to investigate the best way to handle this. If an instance does not have ephemeral storage, then Flintrock defaults to using the root EBS volume.

cwithrowss commented 6 years ago

Thanks! working now- I had used m5.large because the AWS site said m3 was deprecated, but I see it's still supported.

nchammas commented 6 years ago

Great! I'll keep this issue open so I can fix this issue with m5 instances, because Flintrock should ideally work fine with those too.

nchammas / flintrock

Timeout waiting for HDFS master with m5.large instances #256