exalate-issue-sync[bot] commented 1 year ago

In Kerberized clusters, a non-expired ticket is required to access HDFS. This ticket has a limited lifetime. After a ticket is renewed (it has to be renewed before it expires), H2O service has problems accessing HDFS:

{quote}01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: Caught exception: HDFS IO Failure: 01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: accessed URI : hdfs://dcoabhdfs01.some.domain:8020/user/svc_h2oqa/h2oflows/environment/clips 01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: configuration: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, /var/run/cloudera-scm-agent/process/1696-yarn-NODEMANAGER/core-site.xml 01-06 09:28:12.728 10.20.37.31:53011 30029 #s/exists ERRR: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 529 for svc_h2oqa) can't be found in cache; Stacktrace: [water.persist.PersistHdfs.exists(PersistHdfs.java:438), water.persist.PersistManager.exists(PersistManager.java:317), water.init.NodePersistentStorage.exists(NodePersistentStorage.java:94), water.api.NodePersistentStorageHandler.exists(NodePersistentStorageHandler.java:21), sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57), sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), java.lang.reflect.Method.invoke(Method.java:606), water.api.Handler.handle(Handler.java:64), water.api.RequestServer.handle(RequestServer.java:644), water.api.RequestServer.serve(RequestServer.java:585), water.JettyHTTPD$H2oDefaultServlet.doGeneric(JettyHTTPD.java:617), water.JettyHTTPD$H2oDefaultServlet.doGet(JettyHTTPD.java:559), javax.servlet.http.HttpServlet.service(HttpServlet.java:707), javax.servlet.http.HttpServlet.service(HttpServlet.java:820), org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)] {quote}

klist shows new ticket as refreshed and not expired. Does h2o service has to somehow renew its token for the renewed Kerberos ticket? See error above around "token (HDFS_DELEGATION_TOKEN token 529 for svc_h2oqa) can't be found in cache"?

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: This is the relevant file.

https://github.com/h2oai/h2o-3/blob/master/h2o-persist-hdfs/src/main/java/water/persist/PersistHdfs.java

exalate-issue-sync[bot] commented 1 year ago

Ruslan Dautkhanov commented: Thank you Tom.

This might be relevant of how it is implemented in Spark http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html

"For long-running applications .. pass the Spark principal and keytab to the spark-submit script using the --principal and --keytab parameters respectively. This keytab will be copied to the host running the Application Master, and the Kerberos login will be renewed periodically by using this principal and keytab to generate the required delegation tokens for communication with HDFS"

exalate-issue-sync[bot] commented 1 year ago

Ruslan Dautkhanov commented: I thought that this problem shouldn’t exist in Sparkling Water if you guys have a way to pass --principal and --keytab parameters to spark-submit.

Please read “Configuring Spark on YARN for Long-running Applications” - http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html

This keytab will be copied to the host running the Application Master, and the Kerberos login will be renewed periodically by using this principal and keytab to generate the required delegation tokens for communication with HDFS

So Spark itself solves this problem for us if we pass --principal and --keytab parameters. I don't see a way currently to pass these parameters in https://github.com/h2oai/sparkling-water/blob/master/bin/run-sparkling.sh

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: Here is how I start sparkling water on yarn directly with spark-submit. The HDP version, of course, is the version for my test cluster.

SOFTWARE SETUP

tar zxvf /home/tomk/spark-1.5.1-bin-hadoop2.6

ENV VAR SETUP

export SPARK_HOME=/home/tomk/spark-1.5.1-bin-hadoop2.6 export MASTER='yarn-client' export HADOOP_CONF_DIR=/etc/hadoop/conf

CONFIG FILES

 $SPARK_HOME/conf/spark-defaults.conf

spark.driver.extraJavaOptions -Dhdp.version=2.2.6.3-1 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.6.3-1 spark.executor.extraJavaOptions -Dhdp.version=2.2.6.3-1 -XX:+PrintGCDetails

RUNNING SPARKLING WATER

$SPARK_HOME/bin/spark-submit --driver-memory 10g --class water.SparklingWaterDriver --executor-memory 10g --num-executors 5 sparkling-water-assembly-nnn-all.jar

exalate-issue-sync[bot] commented 1 year ago

Ruslan Dautkhanov commented: Thank you Tom. I will give it a try. I would still keep this jira open though as Flow is definitely affected and you may want to implement something similar to Spark's automatic ticket renewal as explained in http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html

exalate-issue-sync[bot] commented 1 year ago

Ruslan Dautkhanov commented: Tom,

Looks like it work in Sparkling Water as is:

{code} export MASTER="yarn-cluster" .. $SPARK_HOME/bin/spark-submit "$@" $VERBOSE --master $MASTER \ --driver-memory $DRIVER_MEMORY --executor-memory $EXECUTOR_MEMORY --executor-cores $EXECUTOR_CORES \ --principal user@DOMAIN.COM --keytab /home/user/.some.keytab \ --conf spark.driver.extraJavaOptions="$EXTRA_DRIVER_PROPS -XX:MaxPermSize=384m" \ --class "$DRIVER_CLASS" $FAT_JAR_FILE {code}

After running ./run-sparkling.sh {noformat} 16/02/16 23:53:53 INFO UserGroupInformation: Login successful for user user@DOMAIN.COM using keytab file /home/user/.some.keytab 16/02/16 23:53:53 INFO Client: Attempting to login to the Kerberos using principal: user@DOMAIN.COM and keytab: /home/user/.some.keytab ... 16/02/16 23:53:54 INFO Client: Renewal Interval set to 86400036 {noformat}

So it was able to use that keytab and set renew automatically.

What I am not getting from AM / driver is the URL to open Sparkling Water.. It is in constant loop of showing: {noformat} . . . 16/02/16 23:59:29 INFO Client: Application report for application_1453687912411_0025 (state: RUNNING) 16/02/16 23:59:33 INFO Client: Application report for application_1453687912411_0025 (state: RUNNING) 16/02/16 23:59:34 INFO Client: Application report for application_1453687912411_0025 (state: RUNNING) 16/02/16 23:59:35 INFO Client: Application report for application_1453687912411_0025 (state: RUNNING) . . {noformat}

I wonder if that's because of Spark Dynamic Allocation which is now by default enabled in CDH 5.5 / Spark 1.5? Thanks!

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: what happens if you use yarn-client mode instead of yarn-cluster mode?

exalate-issue-sync[bot] commented 1 year ago

Ruslan Dautkhanov commented: Spark / Sparkling Water shows peridically in logs

{quote}16/02/28 23:01:30 INFO ExecutorDelegationTokenUpdater: Tokens updated from credentials file. 16/02/28 23:01:30 INFO ExecutorDelegationTokenUpdater: Scheduling token refresh from HDFS in 64800041 millis. 16/02/29 17:01:30 INFO ExecutorDelegationTokenUpdater: Reading new delegation tokens from hdfs://some.domain.com:8020/user/svc_h2oqa/.sparkStaging/application_1455921253797_0004/credentials-7ada43ce-717a-4780-8695-444a027a8f07-12{quote}

when --principal user@DOMAIN.COM --keytab /home/user/.some.keytab were provided.

It's already running for more than a week. h2o never ran that long on secure clusters (can't access hdfs after original kerberos ticket expires, and renewing it locally on servers where h2o started doesn't help). The way it works in Spark: when Spark starts, it uploads supplied keytab to hdfs://some.domain.com:8020/user/svc_h2oqa/.sparkStaging/.. and then executor +on each node+ renews kerberos tickets several times a day.

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: Note: I ran into a person yesterday who described having this problem as well, and he said they were able to solve it. He gave the following advice. Use 'kinit' with the renewable parameter.

exalate-issue-sync[bot] commented 1 year ago

Ruslan Dautkhanov commented: There are forwardable Kerberos tickets (kinit -f). There are proxiable Kerberos tickets (kinit -p). Which one he was talking about?

Our tickets are (R)enewable and (F)orwardable by default: {quote}$ klist -f Ticket cache: FILE:/tmp/krb5cc_2074918846 Default principal: rdautkhanov@some.domain

Valid starting Expires Service principal 03/31/16 16:40:25 04/01/16 02:40:28 krbtgt/some.domain@some.domain renew until 04/07/16 16:40:25, Flags: FRIA {quote}

Notice "FRIA" flags. It does not help. Anyway, after we switched to Sparkling Water this problem went away as Spark has an internal mechanism to renew tickets (see above), but not H2O itself.

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: Hi, I asked about the specifics and this was the response from the user who solved this:

At Strata you were asking how we resolved the expired HDFS_DELEGATION_TOKEN issue with kinit since we have long-running h2o clusters on our hadoop cluster. Here’s what our guys do when they launch the h2o clusters for us.

First step is create a renewable ticket for kerberos:

kinit –kt keytab-file svch2o –r7d # here svch2o is the user account we’re using to launch the h2o processes

[ ... then run startup script with hadoop jar command ... ]

So, apparently there is nothing special going on other than requesting a renewable kerberos ticket for the user before executing the startup script.

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: Would be helpful to know if, for example, '-r365d' solves this issue in your environment. If so, I'll add that to the docs.

exalate-issue-sync[bot] commented 1 year ago

Ruslan Dautkhanov commented: KDC / active directory limit maximum ticket lifetime. You can't do that.

Look what Spark does https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L184 they renew ticket themselves periodically. That's the only fix.

And again, yes, we have renewable tickets but this does not help.

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: Since h2o just uses the standard MapReduce ApplicationMaster, I can't really think of what else there would be to change... Would be interesting to pose this question to one of the Hadoop distro vendors then...

exalate-issue-sync[bot] commented 1 year ago

Ruslan Dautkhanov commented: I know this is an older ticket, but there is an issue with renewing kerberos

It is fixed in Spark 3.0 (not released) https://issues.apache.org/jira/browse/SPARK-25689

Also we don't run into this problem for a while when using h2o external cluster..

I think this ticket can be closed.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2533 Assignee: New H2O Bugs Reporter: Ruslan Dautkhanov State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

h2oai / h2o-3

H2O stops accessing HDFS after Kerberos ticket is renewed #9477

First step is create a renewable ticket for kerberos: