h2oai / sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster
https://docs.h2o.ai/sparkling-water/3.3/latest-stable/doc/index.html
Apache License 2.0
964 stars 361 forks source link

Not able to start external backend on YARN : java.io.IOException: Cannot run program "hadoop": error=2, No such file or directory #1759

Closed BhushG closed 4 years ago

BhushG commented 4 years ago

Here are the logs

20/01/31 14:06:10 WARN external.ExternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
20/01/31 14:06:10 INFO h2o.H2OContext$H2OContextClientBased: Sparkling Water version: 3.28.0.1-1-2.4
20/01/31 14:06:10 INFO h2o.H2OContext$H2OContextClientBased: Spark version: 2.4.4
20/01/31 14:06:10 INFO h2o.H2OContext$H2OContextClientBased: Integrated H2O version: 3.28.0.1
20/01/31 14:06:10 INFO h2o.H2OContext$H2OContextClientBased: The following Spark configuration is used: 
    (spark.ext.h2o.external.cluster.size,2)
    (spark.driver.host,project-master)
    (spark.sql.shuffle.partitions,4)
    (spark.submit.deployMode,cluster)
    (spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS,project-master)
    (spark.ext.h2o.external.h2o.driver,/home/project/sparkling-water-3.28.0.1-1-2.4/h2odriver-sw3.28.0-hdp2.6-extended.jar)
    (spark.ext.h2o.cluster.info.name,notify_H2O_via_SparklingWater_application_1580479162776_0001)
    (spark.app.name,clone3)
    (spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES,http://project-master:8088/proxy/application_1580479162776_0001)
    (spark.executor.id,driver)
    (spark.yarn.dist.files,file:///home/project/dist-0.0.1/project-distribution/application.yml)
    (spark.ext.h2o.hadoop.memory,2G)
    (spark.ext.h2o.cloud.name,H2O_via_SparklingWater_application_1580479162776_0001)
    (spark.yarn.app.container.log.dir,/home/project/usr/local/hadoop/logs/userlogs/application_1580479162776_0001/container_1580479162776_0001_01_000001)
    (spark.master,yarn)
    (spark.ui.port,0)
    (spark.app.id,application_1580479162776_0001)
    (spark.ext.h2o.client.log.dir,logs/H2Ologs)
    (spark.driver.port,38363)
    (spark.locality.wait,30000)
    (spark.executorEnv.JAVA_HOME,/usr/lib/jvm/java-8-openjdk-amd64)
    (spark.ext.h2o.external.start.mode,auto)
    (spark.jars,)
    (spark.ui.filters,org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter)
    (spark.ext.h2o.external.yarn.queue,default)
    (spark.ext.h2o.backend.cluster.mode,external)
    (spark.yarn.app.id,application_1580479162776_0001)
20/01/31 14:06:10 INFO external.ExternalH2OBackend: Starting the external H2O cluster on YARN.
20/01/31 14:06:10 INFO external.ExternalH2OBackend: Command used to start H2O on yarn: hadoop jar /home/project/sparkling-water-3.28.0.1-1-2.4/h2odriver-sw3.28.0-hdp2.6-extended.jar -Dmapreduce.job.queuename=default -Dmapreduce.job.tags=H2O/Sparkling-Water,Sparkling-Water/Spark/application_1580479162776_0001 -Dai.h2o.args.config=sparkling-water-external -nodes 2 -notify notify_H2O_via_SparklingWater_application_1580479162776_0001 -jobname H2O_via_SparklingWater_application_1580479162776_0001 -mapperXmx 2G -nthreads -1 -J -log_level -J INFO -port_offset 1 -baseport 54321 -timeout 120 -disown -sw_ext_backend -J -rest_api_ping_timeout -J 60000 -J -client_disconnect_timeout -J 60000 -extramempercent 10
20/01/31 14:06:10 ERROR job.projectJobDriver$: Job failed in cluster mode with clone3
java.io.IOException: Cannot run program "hadoop": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:69)
    at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.run(ProcessBuilderImpl.scala:100)
    at scala.sys.process.ProcessBuilderImpl$AbstractBuilder$$anonfun$runBuffered$1.apply(ProcessBuilderImpl.scala:148)
    at scala.sys.process.ProcessBuilderImpl$AbstractBuilder$$anonfun$runBuffered$1.apply(ProcessBuilderImpl.scala:148)
    at scala.sys.process.ProcessLogger$$anon$1.buffer(ProcessLogger.scala:99)
    at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.runBuffered(ProcessBuilderImpl.scala:148)
    at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:114)
    at org.apache.spark.h2o.backends.external.ExternalBackendUtils$class.launchShellCommand(ExternalBackendUtils.scala:111)
    at org.apache.spark.h2o.backends.external.ExternalH2OBackend$.launchShellCommand(ExternalH2OBackend.scala:233)
    at org.apache.spark.h2o.backends.external.ExternalH2OBackend.launchExternalH2OOnYarn(ExternalH2OBackend.scala:104)
    at org.apache.spark.h2o.backends.external.ExternalH2OBackend.init(ExternalH2OBackend.scala:46)
    at org.apache.spark.h2o.H2OContext$H2OContextClientBased.initBackend(H2OContext.scala:448)
    at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:150)
    at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:606)
BhushG commented 4 years ago

code is throwing an exception in this function in class ExternalBackendUtils.

''' protected[backends] def launchShellCommand(cmdToLaunch: Seq[String]): Int = { import scala.sys.process._ val processOut = new StringBuffer() val processErr = new StringBuffer()

val proc = cmdToLaunch.mkString(" ").!(ProcessLogger(
  { msg =>
    processOut.append(msg + "\n")
    println(msg)
  }, {
    errMsg =>
      processErr.append(errMsg + "\n")
      println(errMsg)
  }))

logInfo(processOut.toString)
logError(processErr.toString)
proc

} '''

on line : val proc = cmdToLaunch.mkString(" ").!(ProcessLogger(

BhushG commented 4 years ago

This command is passed to launchShellCommand:

_hadoop jar /home/project/sparkling-water-3.28.0.1-1-2.4/h2odriver-sw3.28.0-hdp2.6-extended.jar -Dmapreduce.job.queuename=default -Dmapreduce.job.tags=H2O/Sparkling-Water,Sparkling-Water/Spark/application_1580446727168_0013 -Dai.h2o.args.config=sparkling-water-external -nodes 2 -notify notify_H2O_via_SparklingWater_application_1580446727168_0013 -jobname H2O_via_SparklingWater_application_1580446727168_0013 -mapperXmx 2G -nthreads -1 -J -log_level -J INFO -port_offset 1 -baseport 54321 -timeout 120 -disown -sw_ext_backend -J -rest_api_ping_timeout -J 60000 -J -client_disconnecttimeout -J 60000 -extramempercent 10

mn-mikke commented 4 years ago

Hi @BhushG, what's the hadoop distribution you're trying to run Sparkling Water on?

BhushG commented 4 years ago

I'm running Sparkling Water on Hadoop Yarn hdp 2.7 version. If I run the exact command to start cluster mentioned above H2O External cluster starts successfully. But H20 External cluster initialization is failing through the code

BhushG commented 4 years ago

External cluster initialization code:

val newConf = new H2OConf(spark)
      .setExternalClusterMode()
      .useAutoClusterStart()
      .setClusterSize(2)
      .setMapperXmx("2G")
      .set(H2OPathDir, "logs/H2Ologs")
      .setYARNQueue("default")

    log.info(
      "Creating H2OContext in H2OModelSelection.."
    )
    val h2OContext = H2OContext.getOrCreate(spark, newConf)
    log.info(
      "H2OContext created successfully.." + h2OContext.getH2ONodes().toString
    )

and spark.ext.h2o.external.h2o.driver this parameter points to H2O Extended driver jar file.

BhushG commented 4 years ago

@mn-mikke

If I initiate H2O External cluster using sparkling shell, then also it works fine:

scala> val conf = new H2OConf(spark).setExternalClusterMode().useAutoClusterStart().setH2ODriverPath(extendedJar).setClusterSize(2).setMapperXmx("2G").setYARNQueue("default");
20/02/01 05:13:17 WARN h2o.H2OConf: Using external cluster mode!
conf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
  backend cluster mode : external
  cluster start mode   : auto
  cloudName            : Not set yet
  cloud representative : Not set, using cloud name only
  clientBasePort       : 54321
  h2oClientLog         : INFO
  nthreads             : -1

scala> val hc = H2OContext.getOrCreate(spark, conf);
20/02/01 05:13:31 WARN external.ExternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins,
we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1.
E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
We also recommend to avoid using broadcast hints in your Spark SQL code.
20/02/01 05:13:31 WARN external.ExternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: 172.17.0.2]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.17.0.2:43863
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
    mapreduce.map.java.opts:     -Xms2G -Xmx2G -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     2252
Hive driver not present, not generating token.
20/02/01 05:13:43 INFO client.RMProxy: Connecting to ResourceManager at project-master/172.17.0.2:8032
20/02/01 05:13:48 INFO mapreduce.JobSubmitter: number of splits:2
20/02/01 05:13:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1580479162776_0008
20/02/01 05:13:49 INFO impl.YarnClientImpl: Submitted application application_1580479162776_0008
20/02/01 05:13:49 INFO mapreduce.Job: The url to track the job: http://project-master:8088/proxy/application_1580479162776_0008/
Job name 'H2O_via_SparklingWater_local-1580533150378' submitted
JobTracker job ID is 'job_1580479162776_0008'
For YARN users, logs command is 'yarn logs -applicationId application_1580479162776_0008'
Waiting for H2O cluster to come up...
H2O node 172.17.0.2:54323 is joining the cluster
H2O node 172.17.0.2:54321 is joining the cluster
Sending flatfiles to nodes...
    [Sending flatfile to node 172.17.0.2:54323]
    [Sending flatfile to node 172.17.0.2:54321]
H2O node 172.17.0.2:54323 reports H2O cluster size 1 [leader is 172.17.0.2:172.17.0.2]
H2O node 172.17.0.2:54321 reports H2O cluster size 1 [leader is 172.17.0.2:172.17.0.2]
H2O node 172.17.0.2:54321 reports H2O cluster size 2 [leader is 172.17.0.2:172.17.0.2]
H2O node 172.17.0.2:54323 reports H2O cluster size 2 [leader is 172.17.0.2:172.17.0.2]
Cluster notification file (notify_H2O_via_SparklingWater_local-1580533150378) created.
H2O cluster (2 nodes) is up
Open H2O Flow in your web browser: http://172.17.0.2:54321
Disowning cluster and exiting.

For YARN users, logs command is 'yarn logs -applicationId application_1580479162776_0008'

20/02/01 05:14:10 ERROR external.ExternalH2OBackend: 20/02/01 05:13:43 INFO client.RMProxy: Connecting to ResourceManager at project-master/172.17.0.2:8032
20/02/01 05:13:48 INFO mapreduce.JobSubmitter: number of splits:2
20/02/01 05:13:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1580479162776_0008
20/02/01 05:13:49 INFO impl.YarnClientImpl: Submitted application application_1580479162776_0008
20/02/01 05:13:49 INFO mapreduce.Job: The url to track the job: http://project-master:8088/proxy/application_1580479162776_0008/

hc: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.1-1-2.4
 * H2O name: H2O_via_SparklingWater_local-1580533150378
 * cluster size: 2
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (0,172.17.0.2,54321)
  (1,172.17.0.2,54323)
  ------------------------

  Open H2O Flow in browser: http://172.17.0.2:54325 (CMD + click in Mac OSX)

 * Yarn App ID of external H2O cluster: application_1580479162776_0008
BhushG commented 4 years ago

@jakubhava @mn-mikke Hi. I'm not able to run H2O ML programs neither on internal backend nor on the external backend. I'm stuck. Please let me know if you can help me with these exceptions.

BhushG commented 4 years ago

I got the reason for this exception. I wrote a simple program to run Hadoop command on yarn cluster.

import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

public class CommandTest {
    public static void main(String[] args) throws Exception
    {
        SparkConf sparkConf = new SparkConf().setAppName("SimpleCommandTest").setMaster(args[0]);
        SparkSession.builder().config(sparkConf).getOrCreate();
        System.out.println("Bhushan: Simple program for testing");
        System.out.println("\n\nBhushan: Running simple hadoop command");
        final Process p = Runtime.getRuntime().exec(args[1]);

        new Thread(new Runnable() {
            public void run() {
                BufferedReader input = new BufferedReader(new InputStreamReader(p.getInputStream()));
                String line = null;

                try {
                    while ((line = input.readLine()) != null)
                        System.out.println(line);
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }).start();

        p.waitFor();

    }
}

Then I submitted this job to YARN cluster using command:

spark-submit --class CommandTest --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.HADOOP_HOME=/home/project/usr/local/hadoop/bin SimpleTest-1.0-SNAPSHOT.jar "yarn" "hadoop fs -ls /"

then it throws this exception:

java.io.IOException: Cannot run program "hadoop": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at java.lang.Runtime.exec(Runtime.java:620)
    at java.lang.Runtime.exec(Runtime.java:450)
    at java.lang.Runtime.exec(Runtime.java:347)

Hadoop is not in path. So it is not able to execute. Any solution for this?

BhushG commented 4 years ago

This is what it is happening there : https://stackoverflow.com/questions/51379619/running-hadoop-command-from-java-file

BhushG commented 4 years ago

If I give the full path to hadoop then my code CommandTest is running successfully on yarn. eg spark-submit --class CommandTest --master yarn --deploy-mode cluster SimpleTest-1.0-SNAPSHOT.jar "yarn" "/home/bhushan/usr/local/hadoop/bin/hadoop dfs -ls /"

But I can't provide full path to hadoop in sparkling water code. How can I give it full path to hadoop?

jakubhava commented 4 years ago

@BhushG thanks for the investigation! Right now Sparkling Water expects that hadoop is on the PATH, so please add it there.

Adding a option which could specify full hadoop version is a good idea, we might add that - https://0xdata.atlassian.net/browse/SW-1866

BhushG commented 4 years ago

@jakubhava hadoop is already in PATH. That's why I can execute hadoop commands from anywhere. I don't need to specify full path to hadoop. But when I'm executing a spark job to execute command such as "hadoop fs -ls /" then it is not able to find hadoop. In this case, running command through a spark job, I've to specify full path eg. "/home/bhushan/usr/local/hadoop/bin/hadoop dfs -ls /", only then it is able to execute that command.

BhushG commented 4 years ago

Even though Hadoop is in PATH. Spark job is not able to detect it.

jakubhava commented 4 years ago

In that case it seems like some sort of Spark and not Sparkling Water one, what do you think? I have put in the Sparkling Water code configuration for specifying full path to hadoop to help out with this issue though https://github.com/h2oai/sparkling-water/pull/1766

BhushG commented 4 years ago

@jakubhava

I checked out your branch SW-1866 on Intellij IDE. I tried to build it using command ./gradlew dist as mentioned here: http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/devel/build.html. But built failed.


> Task :sparkling-water-extensions:compileScala
Pruning sources from previous analysis, due to incompatible CompileSetup.
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/SparklingWaterServletProvider.scala:23: object ServletMeta is not a member of package water.server
import water.server.{ServletMeta, ServletProvider}
       ^
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/SparklingWaterServletProvider.scala:25: not found: type ServletProvider
class SparklingWaterServletProvider extends ServletProvider {
                                            ^
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/SparklingWaterServletProvider.scala:31: not found: type ServletMeta
  override def servlets(): util.List[ServletMeta] = {
                                     ^
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/SparklingWaterServletProvider.scala:32: not found: type ServletMeta
    Collections.singletonList(new ServletMeta(Paths.CHUNK, classOf[ChunkServlet]))
                                  ^
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/serde/ChunkAutoBufferReader.scala:58: method getInt in class AutoBuffer cannot be accessed in water.AutoBuffer
    val data = buffer.getInt
                      ^
5 errors found

> Task :sparkling-water-extensions:compileScala FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':sparkling-water-extensions:compileScala'.
> Compilation failed

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 13s
16 actionable tasks: 4 executed, 12 up-to-date
(base) EC_C02NL3D9G3QD:sparkling-water project$ ./gradlew dist
To honour the JVM settings for this build a new JVM will be forked. Please consider using the daemon: https://docs.gradle.org/5.6.4/userguide/gradle_daemon.html.
Daemon will be stopped at the end of the build stopping after processing

> Task :sparkling-water-extensions:compileScala FAILED
Pruning sources from previous analysis, due to incompatible CompileSetup.
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/SparklingWaterServletProvider.scala:23: object ServletMeta is not a member of package water.server
import water.server.{ServletMeta, ServletProvider}
       ^
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/SparklingWaterServletProvider.scala:25: not found: type ServletProvider
class SparklingWaterServletProvider extends ServletProvider {
                                            ^
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/SparklingWaterServletProvider.scala:31: not found: type ServletMeta
  override def servlets(): util.List[ServletMeta] = {
                                     ^
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/SparklingWaterServletProvider.scala:32: not found: type ServletMeta
    Collections.singletonList(new ServletMeta(Paths.CHUNK, classOf[ChunkServlet]))
                                  ^
/Users/project/Bhushan/Projects/SparklingWaterH2OFull/sparkling-water/extensions/src/main/scala/ai/h2o/sparkling/extensions/serde/ChunkAutoBufferReader.scala:58: method getInt in class AutoBuffer cannot be accessed in water.AutoBuffer
    val data = buffer.getInt
                      ^
5 errors found

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':sparkling-water-extensions:compileScala'.
> Compilation failed

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 20s
16 actionable tasks: 6 executed, 10 up-to-date

Am i using right command for build or not?

Even master branch build is failing.

mn-mikke commented 4 years ago

Hi @BhushG, The current master contains some changes that require build against masteror rel-yu of H2O-3.

To build this branch, you also need to get h2o-3 repo, switch to rel-yu and build SW with H2O-3: ./gradlew build -x check --include-build H2O_REPO_PATH

to build deployable artifacts run: ./gradlew dist --include-build H2O_REPO_PATH

Please also see H2O-3 build instructions: https://github.com/h2oai/h2o-3#4-building-h2o-3

BhushG commented 4 years ago

@mn-mikke Thanks

@mn-mikke @jakubhava Hey.. finally I got the problem: I was using Cluster as deploy-mode. I changed it to the Client. Now the Hadoop command is getting executed successfully.

Will it solve this [https://github.com/h2oai/sparkling-water/issues/1739] Internal backend issue also by setting deploy-mode to client? What do you think?

BhushG commented 4 years ago

Hello @mn-mikke , @jakubhava , We can set PATH for yarn-cluster mode also. set following spark-conf property:

spark.yarn.appMasterEnv.PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/project/usr/local/hadoop/bin:/home/project/usr/local/hadoop/sbin:/home/project/usr/local/spark/bin:/home/project/usr/local/kafka/bin:/home/project/usr/local/scala/bin:/home/project/usr/local/hadoop/bin:/home/project/usr/local/hadoop/sbin:/home/project/usr/local/spark/bin:/home/project/usr/local/kafka/bin:/home/project/usr/local/scala/bin:/home/project/usr/local/hive/bin:/home/project/usr/local/zookeeper/bin

The issue has been resolved. Thanks for your help @mn-mikke , @jakubhava :)