HDFS Data nodes from slaves can't connect to the name node

pilgrimkst commented 7 years ago

Hi, i have problems working with hdfs. In order to access to the hdfs i was need to start data nodes (cd /home/ec2-user/hadoop && sbin/hadoop-daemon.sh stop datanode) But after i am starting datanodes on all instances (both master and slaves) there is only one datanode that is registered from the master (ie running locally with namenode) there is connectivity between nodes, but i have following error message in both datanode and namenode logs:

2017-07-26 16:40:40,493 INFO  [IPC Server handler 5 on 9000] ipc.Server (Server.java:run(2070)) - IPC Server handler 5 on 9000, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.registerDatanode from 172.33.9.26:59572 Call#209 Retry#0
org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode denied communication with namenode because hostname cannot be resolved (ip=172.33.9.26, hostname=172.33.9.26): DatanodeRegistration(0.0.0.0:50010, datanodeUuid=39d7a850-7e68-4d9f-a5be-becb81af752f, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-94f28171-a765-4610-8fd9-3b6a8be02383;nsid=280054336;c=0)
    at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:873)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:4529)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:1286)
    at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:96)
    at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:28752)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

Where 172.33.9.26 is internal ip of one of my slaves (errors are logged from all slaves, i just added one for reference

Flintrock version: flintrock, version 0.8.0, 0.9.0
Python version: 3.6
OS: mac OS 10.12.5

nchammas commented 7 years ago

In order to access to the hdfs i was need to start data nodes (cd /home/ec2-user/hadoop && sbin/hadoop-daemon.sh stop datanode)

Hmm, why do you need to do this? Flintrock should start up HDFS automatically for you as long as you specify --install-hdfs (or the equivalent in config.yaml).

Also, does your VPC have an Internet Gateway attached? Flintrock does not currently support private VPCs (#14).

pilgrimkst commented 7 years ago

Yes, i am installing hdfs with

launch:
  install-hdfs: True

I will check VPC settings, and write back

pilgrimkst commented 7 years ago

Ok, I checked VPC, it is has attached internet gateway
Also I had realized that i wrote wrong versions and os, I am running it from my mac, and version was 0.9.0 - dev.
Today I was recreated cluster with 0.8.0 version. And had same issue from one of the slaves, second was connected successfully. Pings are passing from failing node to master external host(resolved ip address is internal)
Also when i downgraded from 0.9.0 min-root-ebs-size-gb: 100 stops working, i got standard 30GB root volume for all of my instances
recreated cluster with 3 slaves, no datanodes connected to name node

this is my flintrock config https://gist.github.com/pilgrimkst/204b000e195e543d54a159cebed63168 Also I want to mention, that all spark workers are initialized

pilgrimkst commented 7 years ago

Ok, I think i had found an issue, I removed one security group and it started to work. So I am closing issue

nchammas commented 7 years ago

Ah, so one of the additional security groups you had configured on launch was interfering with Flintrock?

pilgrimkst commented 7 years ago

Yeah, we had two flintrock clusters, and i saw on the instances SG with name flintrock, so i thought it would be a good idea to add it. But I guess this one was preventing to create it's own flintrock group.

nchammas / flintrock

HDFS Data nodes from slaves can't connect to the name node #206