jcrist / skein

A tool and library for easily deploying applications on Apache YARN
https://jcristharif.com/skein/
BSD 3-Clause "New" or "Revised" License
142 stars 39 forks source link

Skein cannot find the resource manager host #196

Open karlam123 opened 4 years ago

karlam123 commented 4 years ago

Hi!

I'm trying to run skein application submit hello_world.yaml found here https://jcrist.github.io/skein/quickstart.html.

I'm on HDP 3.1.4.0-315, python3.6 and skein 0.8.0. The following environment variables are set: HADOOP_HOME, HADOOP_CONF_DIR and HADOOP_HDFS_HOME.

Logs:

19/11/26 08:52:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/26 08:52:04 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be ucomd becaucom libhadoop cannot be loaded.
19/11/26 08:52:05 INFO client.AHSProxy: Connecting to Application History server at host.sss.com.company.com/ip_address:10200
19/11/26 08:52:05 INFO skein.Driver: Driver started, listening on 38539
19/11/26 08:52:06 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.4.0-315/0/resource-types.xml
19/11/26 08:52:06 INFO resource.ResourceUtils: Adding resource type - name = yarn.io/gpu, units = , type = COUNTABLE
19/11/26 08:52:06 INFO hdfs.DFSClient: Created token for karl: HDFS_DELEGATION_TOKEN owner=KARL@COMPANY_HOST, renewer=yarn, realUser=, issueDate=1574758326364, maxDate=1575363126364, sequenceNumber=45051, masterKeyId=261 on ha-hdfs:ProdHadoop
19/11/26 08:52:06 INFO kms.KMSClientProvider: New token created: (Kind: kms-dt, Service: kms://https@host3.sss.com.company.com:port/kms, Ident: (kms-dt owner=KARL, renewer=yarn, realUser=, issueDate=1574758326766, maxDate=1575363126766, sequenceNumber=44570, masterKeyId=399))
19/11/26 08:52:06 INFO security.TokenCache: Got dt for hdfs://ProdHadoop; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ProdHadoop, Ident: (token for karl: HDFS_DELEGATION_TOKEN owner=KARL@COMPANY_HOST, renewer=yarn, realUser=, issueDate=1574758326364, maxDate=1575363126364, sequenceNumber=45051, masterKeyId=261)
19/11/26 08:52:06 INFO security.TokenCache: Got dt for hdfs://ProdHadoop; Kind: kms-dt, Service: kms://https@host1.sss.com.company.com;host2.sss.com.company.com;host3.sss.com.company.com:port/kms, Ident: (kms-dt owner=KARL, renewer=yarn, realUser=, issueDate=1574758326766, maxDate=1575363126766, sequenceNumber=44570, masterKeyId=399)
Error: Failed to submit application, exception:
org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No rules applied to rm/_HOST@COMPANY_HOST
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier.setRenewer(AbstractDelegationTokenIdentifier.java:113)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier.<init>(AbstractDelegationTokenIdentifier.java:57)
        at org.apache.hadoop.yarn.security.client.YARNDelegationTokenIdentifier.<init>(YARNDelegationTokenIdentifier.java:42)
        at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier.<init>(RMDelegationTokenIdentifier.java:61)
        at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1083)
        at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:355)
        at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:579)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
Caused by: org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No rules applied to rm/_HOST@COMPANY_HOST
        at org.apache.hadoop.security.authentication.util.KerberosName.getShortName(KerberosName.java:401)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier.setRenewer(AbstractDelegationTokenIdentifier.java:111)
        ... 14 more

It seems to me that _HOST in rm@_HOST@COMPANY_HOST is not found. I can submit jobs using spark with spark-submit, so I think the configuration on the cluster is OK.

Does anybody have an idea on what the problem could be or where I should look?

jcrist commented 4 years ago

Hmmm, this is a new one. I suspect your spark-submit configuration has some extra things specified that we're not picking up (spark and hadoop CLIs allow specifying configuration options in environment variables that aren't picked up by the corresponding java libraries, which is annoying). A few things that would help debug:

There may be something else going on besides the above, but this is what I'd check first.

karlam123 commented 4 years ago

Thanks for the help!

jcrist commented 4 years ago

Hmmm, ok. A few other places there might be things:

jcrist commented 4 years ago

The following links might also be relevant, particularly the solution in the stackoverflow one:

It's not clear to me why yarn/spark would work when our code doesn't - we should be taking the same login path as those tools.

karlam123 commented 4 years ago

Thanks and sorry for the late reply!

Thank you for the links, I'll look into them.

jcrist commented 4 years ago

Try setting SKEIN_DRIVER_JAVA_OPTIONS as follows:

export SKEIN_DRIVER_JAVA_OPTIONS="$HADOOP_OPTS -Djavax.security.auth.useSubjectCredsOnly=false"

The javax.security.auth.useSubjectCredsOnly property deals with kerberos authentication and may lead to the failure we're seeing here.

karlam123 commented 4 years ago

Still the same error unfortunately.

jcrist commented 4 years ago

In my example above did you already have a global running skein driver (had you run skein driver start previously?). The java options are only loaded on driver startup. To be clear, the test above would have been:

$ skein driver stop  # ensure there's no global driver
$ export SKEIN_DRIVER_JAVA_OPTIONS="$HADOOP_OPTS -Djavax.security.auth.useSubjectCredsOnly=false"
$ skein application list

If that's what you did and it didn't work, I'm at a loss here. It's likely a difference between the Java libraries we use (which read from configuration files) and the CLI tools you've successfully gotten working (which pick up additional options from environment variables, shell files, etc...). I'm not sure what else to check. If you happen to have a way to reproduce this in a failing environment (e.g. a docker image) this would make it easier for me to debug locally, but as is I'm not sure how else to help (sorry).

karlam123 commented 4 years ago

First of all, thanks for the help!

Yes, that is what I did. Currently, I don't have a way to reproduce this in a docker image, so I think I will try to familiarize myself with your java code and see if I can't figure out why the host is not picked up for our particular environment.

karlam123 commented 4 years ago

Small update: I saw that when I in Driver.java replaced String tokenRenewer = conf.get(YarnConfiguration.RM_PRINCIPAL); with a hardcoded value found in core-site.xml I could submit applications.

Then I showed this to a colleague who have a lot of experience with kerberos and he said that, with some exceptions, as long as the user is valid one can use that one instead.

So I put String tokenRenewer = ugi.getUserName(); and that also worked.

jcrist commented 4 years ago

Interesting. What property did you take the hardcoded value from? Do you remember what the exceptions were?

I'm a bit hesitant to change our code here, our current implementation matches that recommended by the YARN docs and also that in other projects. I'd prefer (if possible) to find a way that works for you to get the RM principal as a renewer. That said, there have been plenty of times where the yarn docs have been flat out wrong, and we have many hacks around yarn bugs, so if your solution works we could use that too.

karlam123 commented 4 years ago

Hi!

I think I have found a way to get the RM principal as a renewer. The underlying issue seems to be that we have a HA setup for the RM and conf.get(YarnConfiguration.RM_PRINCIPAL); doesn't do the _HOST replacement. In Hadoop 3, they have something that can do it: http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-client/apidocs/org/apache/hadoop/yarn/client/util/YarnClientUtils.html

so I copy pasted the code for public static String getRmPrincipal(String rmPrincipal, Configuration conf) found here: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/util/YarnClientUtils.java then it also worked. Seems to be how they solved it here as well: https://github.com/linkedin/TonY/blob/master/tony-core/src/main/java/com/linkedin/tony/TonyClient.java.