Skein cannot find the resource manager host

karlam123 commented 4 years ago

Hi!

I'm trying to run skein application submit hello_world.yaml found here https://jcrist.github.io/skein/quickstart.html.

I'm on HDP 3.1.4.0-315, python3.6 and skein 0.8.0. The following environment variables are set: HADOOP_HOME, HADOOP_CONF_DIR and HADOOP_HDFS_HOME.

Logs:

19/11/26 08:52:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/26 08:52:04 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be ucomd becaucom libhadoop cannot be loaded.
19/11/26 08:52:05 INFO client.AHSProxy: Connecting to Application History server at host.sss.com.company.com/ip_address:10200
19/11/26 08:52:05 INFO skein.Driver: Driver started, listening on 38539
19/11/26 08:52:06 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.4.0-315/0/resource-types.xml
19/11/26 08:52:06 INFO resource.ResourceUtils: Adding resource type - name = yarn.io/gpu, units = , type = COUNTABLE
19/11/26 08:52:06 INFO hdfs.DFSClient: Created token for karl: HDFS_DELEGATION_TOKEN owner=KARL@COMPANY_HOST, renewer=yarn, realUser=, issueDate=1574758326364, maxDate=1575363126364, sequenceNumber=45051, masterKeyId=261 on ha-hdfs:ProdHadoop
19/11/26 08:52:06 INFO kms.KMSClientProvider: New token created: (Kind: kms-dt, Service: kms://https@host3.sss.com.company.com:port/kms, Ident: (kms-dt owner=KARL, renewer=yarn, realUser=, issueDate=1574758326766, maxDate=1575363126766, sequenceNumber=44570, masterKeyId=399))
19/11/26 08:52:06 INFO security.TokenCache: Got dt for hdfs://ProdHadoop; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ProdHadoop, Ident: (token for karl: HDFS_DELEGATION_TOKEN owner=KARL@COMPANY_HOST, renewer=yarn, realUser=, issueDate=1574758326364, maxDate=1575363126364, sequenceNumber=45051, masterKeyId=261)
19/11/26 08:52:06 INFO security.TokenCache: Got dt for hdfs://ProdHadoop; Kind: kms-dt, Service: kms://https@host1.sss.com.company.com;host2.sss.com.company.com;host3.sss.com.company.com:port/kms, Ident: (kms-dt owner=KARL, renewer=yarn, realUser=, issueDate=1574758326766, maxDate=1575363126766, sequenceNumber=44570, masterKeyId=399)
Error: Failed to submit application, exception:
org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No rules applied to rm/_HOST@COMPANY_HOST
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier.setRenewer(AbstractDelegationTokenIdentifier.java:113)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier.<init>(AbstractDelegationTokenIdentifier.java:57)
        at org.apache.hadoop.yarn.security.client.YARNDelegationTokenIdentifier.<init>(YARNDelegationTokenIdentifier.java:42)
        at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier.<init>(RMDelegationTokenIdentifier.java:61)
        at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1083)
        at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:355)
        at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:579)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
Caused by: org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No rules applied to rm/_HOST@COMPANY_HOST
        at org.apache.hadoop.security.authentication.util.KerberosName.getShortName(KerberosName.java:401)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier.setRenewer(AbstractDelegationTokenIdentifier.java:111)
        ... 14 more

It seems to me that _HOST in rm@_HOST@COMPANY_HOST is not found. I can submit jobs using spark with spark-submit, so I think the configuration on the cluster is OK.

Does anybody have an idea on what the problem could be or where I should look?

jcrist commented 4 years ago

Hmmm, this is a new one. I suspect your spark-submit configuration has some extra things specified that we're not picking up (spark and hadoop CLIs allow specifying configuration options in environment variables that aren't picked up by the corresponding java libraries, which is annoying). A few things that would help debug:

Does the yarn cli work? Try yarn application -list
Do you have any *_OPTS environment variables set (like HADOOP_OPTS)? The java libraries that skein is built on only load configuration from files, not environment variables.
Do you have any spark configuration files with spark.hadoop.* variables set? These override values found in the standard *-site.xml files, and may explain this discrepancy. See https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration.

There may be something else going on besides the above, but this is what I'd check first.

karlam123 commented 4 years ago

Thanks for the help!

Yep, yarn application -list works.
No, I've also tried without any of those environment variables.

We have some defaults in /usr/hdp/current/spark2-client/conf/spark-defaults.conf:

spark.hadoop.hive.llap.daemon.service.hosts
spark.hadoop.hive.zookeeper.quorum
spark.yarn.access.hadoopFileSystems hdfs://ProdHadoop:8020
spark.yarn.am.extraJavaOptions -Dhdp.version=3.1.4.0-315
spark.yarn.historyServer.address

jcrist commented 4 years ago

Hmmm, ok. A few other places there might be things:

Do you have HADOOP_YARN_HOME set? I not, if you set it appropriately do things work out (should be something like /usr/lib/hadoop-yarn).
Do you have JAVA_HOME set? Does it point to the correct java for yarn to use? If not, is which java the correct java?
Does yarn classpath include somewhere in there the directory with your hadoop configuration files (yarn-site.xml, core-site.xml, ...)?
You may have a file like yarn-env.sh and/or hadoop-env.sh somewhere (I'd use find to find this, it could be in a few places depending on your system). I'd look for anything that looks like it's setting a configuration/home dir (e.g. HADOOP_CONF_DIR/HADOOP_YARN_HOME, ...), java options (-D...), or kerberos related things (environment variables with KRB5 or kerberos in them).

jcrist commented 4 years ago

The following links might also be relevant, particularly the solution in the stackoverflow one:

It's not clear to me why yarn/spark would work when our code doesn't - we should be taking the same login path as those tools.

karlam123 commented 4 years ago

Thanks and sorry for the late reply!

HADOOP_YARN_HOME was not set, but tried to set it the same was as in /usr/hdp/3.1.4.0-315/hadoop/conf/yarn-env.sh to /usr/hdp/3.1.4.0-315/hadoop-yarn with the same result.
JAVA_HOME was not set, changed it to $(dirname $(dirname $(readlink -f $(which javac)))), which is the same as in yarn-env.sh, still the same error.
yarn classpath have the directory with hadoop configuration files (/usr/hdp/3.1.4.0-315/hadoop/conf)

There is some java options in yarn-env.sh

export YARN_RESOURCEMANAGER_OPTS="-Djava.security.auth.login.config=/etc/hadoop/3.1.4.0-315/0/yarn_jaas.conf
export YARN_TIMELINESERVER_OPTS="-Djava.security.auth.login.config=/etc/hadoop/3.1.4.0-315/0/yarn_ats_jaas.conf"
export YARN_TIMELINEREADER_OPTS="-Djava.security.auth.login.config=/etc/hadoop/3.1.4.0-315/0/yarn_ats_jaas.conf"
export YARN_REGISTRYDNS_OPTS="-Djava.security.auth.login.config=/etc/hadoop/3.1.4.0-315/0/yarn_registry_dns_jaas.conf"
export YARN_NODEMANAGER_OPTS="-Djava.security.auth.login.config=/etc/hadoop/3.1.4.0-315/0/yarn_nm_jaas.conf -Dsun.security.krb5.rcache=none"
HADOOP_OPTS="$HADOOP_OPTS -Djavax.security.auth.useSubjectCredsOnly=false"
YARN_RESOURCEMANAGER_OPTS="-Dzookeeper.sasl.client=true -Dzookeeper.sasl.client.username=zookeeper -Djava.security.auth.login.config=/etc/hadoop/3.1.4.0-315/0/yarn_jaas.conf -Dzookeeper.sasl.clientconfig=Client $YARN_RESOURCEMANAGER_OPTS"

Thank you for the links, I'll look into them.

jcrist commented 4 years ago

Try setting SKEIN_DRIVER_JAVA_OPTIONS as follows:

export SKEIN_DRIVER_JAVA_OPTIONS="$HADOOP_OPTS -Djavax.security.auth.useSubjectCredsOnly=false"

The javax.security.auth.useSubjectCredsOnly property deals with kerberos authentication and may lead to the failure we're seeing here.

karlam123 commented 4 years ago

Still the same error unfortunately.

jcrist commented 4 years ago

In my example above did you already have a global running skein driver (had you run skein driver start previously?). The java options are only loaded on driver startup. To be clear, the test above would have been:

$ skein driver stop  # ensure there's no global driver
$ export SKEIN_DRIVER_JAVA_OPTIONS="$HADOOP_OPTS -Djavax.security.auth.useSubjectCredsOnly=false"
$ skein application list

If that's what you did and it didn't work, I'm at a loss here. It's likely a difference between the Java libraries we use (which read from configuration files) and the CLI tools you've successfully gotten working (which pick up additional options from environment variables, shell files, etc...). I'm not sure what else to check. If you happen to have a way to reproduce this in a failing environment (e.g. a docker image) this would make it easier for me to debug locally, but as is I'm not sure how else to help (sorry).

karlam123 commented 4 years ago

First of all, thanks for the help!

Yes, that is what I did. Currently, I don't have a way to reproduce this in a docker image, so I think I will try to familiarize myself with your java code and see if I can't figure out why the host is not picked up for our particular environment.

karlam123 commented 4 years ago

Small update: I saw that when I in Driver.java replaced String tokenRenewer = conf.get(YarnConfiguration.RM_PRINCIPAL); with a hardcoded value found in core-site.xml I could submit applications.

Then I showed this to a colleague who have a lot of experience with kerberos and he said that, with some exceptions, as long as the user is valid one can use that one instead.

So I put String tokenRenewer = ugi.getUserName(); and that also worked.

jcrist commented 4 years ago

Interesting. What property did you take the hardcoded value from? Do you remember what the exceptions were?

I'm a bit hesitant to change our code here, our current implementation matches that recommended by the YARN docs and also that in other projects. I'd prefer (if possible) to find a way that works for you to get the RM principal as a renewer. That said, there have been plenty of times where the yarn docs have been flat out wrong, and we have many hacks around yarn bugs, so if your solution works we could use that too.

karlam123 commented 4 years ago

Hi!

I think I have found a way to get the RM principal as a renewer. The underlying issue seems to be that we have a HA setup for the RM and conf.get(YarnConfiguration.RM_PRINCIPAL); doesn't do the _HOST replacement. In Hadoop 3, they have something that can do it: http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-client/apidocs/org/apache/hadoop/yarn/client/util/YarnClientUtils.html

so I copy pasted the code for public static String getRmPrincipal(String rmPrincipal, Configuration conf) found here: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/util/YarnClientUtils.java then it also worked. Seems to be how they solved it here as well: https://github.com/linkedin/TonY/blob/master/tony-core/src/main/java/com/linkedin/tony/TonyClient.java.

jcrist / skein

Skein cannot find the resource manager host #196