wrong order while fetching resources

dbolshak commented 7 years ago

I submit my spark job as

cnt-bolshakov-mbp:~ dbolshak$ dcos spark run --submit-args="--executor-memory 20G --total-executor-cores 10 --conf spark.eventLog.enabled=true --class fullQuilifiedClassName hdfs://path/my.jar"
127.0.0.1 - - [30/Jun/2017 10:46:01] "POST /v1/submissions/create HTTP/1.1" 200 -
127.0.0.1 - - [30/Jun/2017 10:46:01] "GET /v1/submissions/status/driver-20170630074601-0013 HTTP/1.1" 200 -
Run job succeeded. Submission id: driver-20170630074601-0013

Of course, fullQuilifiedClassName and hdfs://path/my.jar refer to real values.

But job fails while fetching resources with erorr

task_id { value: "driver-20170630074601-0013" } state: TASK_FAILED message: "Failed to launch container: Failed to fetch all URIs for container \'4516a167-67b4-40d4-a7fc-9a1c275fc150\' with exit status: 256" slave_id { value: "a664123b-71b7-4af3-81b9-691082c18b82-S23" } timestamp: 1.498808347459905E9 executor_id { value: "driver-20170630073906-0010" } source: SOURCE_SLAVE reason: REASON_CONTAINER_LAUNCH_FAILED uuid: "\343Y.%\265:Nc\207\241,(p\333-\234" container_status { }

At the same time one of mesos agents has the folloing logs

I0630 10:46:01.748901 128412 fetcher.cpp:531] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/a664123b-71b7-4af3-81b9-691082c18b82-S22","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"hdfs://path/my.jar"}},{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"http:\/\/api.hdfs.marathon.l4lb.thisdcos.directory\/v1\/endpoints\/hdfs-site.xml"}},{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"http:\/\/api.hdfs.marathon.l4lb.thisdcos.directory\/v1\/endpoints\/core-site.xml"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\/a664123b-71b7-4af3-81b9-691082c18b82-S22\/frameworks\/a664123b-71b7-4af3-81b9-691082c18b82-0012\/executors\/driver-20170630074601-0013\/runs\/84e4daa2-86b3-4f37-894f-3195c59f4325"}
I0630 10:46:01.752202 128412 fetcher.cpp:442] Fetching URI 'hdfs://path/my.jar'
I0630 10:46:01.752223 128412 fetcher.cpp:283] Fetching directly into the sandbox directory
I0630 10:46:01.752244 128412 fetcher.cpp:220] Fetching URI 'hdfs://path/my.jar'
I0630 10:46:02.005959 128412 fetcher.cpp:138] Downloading resource with Hadoop client from 'hdfs://path/my.jar' to '/var/lib/mesos/slave/slaves/a664123b-71b7-4af3-81b9-691082c18b82-S22/frameworks/a664123b-71b7-4af3-81b9-691082c18b82-0012/executors/driver-20170630074601-0013/runs/84e4daa2-86b3-4f37-894f-3195c59f4325/my.jar'
Failed to fetch 'hdfs://path/my.jar': HDFS copyToLocal failed: Unexpected result from the subprocess: status='256', stdout='', stderr='Exception in thread "main" java.lang.RuntimeException: core-site.xml not found
    at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2518)
    at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2444)
    at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2361)
    at org.apache.hadoop.conf.Configuration.set(Configuration.java:1099)
    at org.apache.hadoop.conf.Configuration.set(Configuration.java:1071)
    at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1409)
    at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:319)
    at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:485)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
'
End fetcher log for container 84e4daa2-86b3-4f37-894f-3195c59f4325

Looking at the first line

I0630 10:46:01.748901 128412 fetcher.cpp:531] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/a664123b-71b7-4af3-81b9-691082c18b82-S22","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"hdfs://path/my.jar"}},{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"http:\/\/api.hdfs.marathon.l4lb.thisdcos.directory\/v1\/endpoints\/hdfs-site.xml"}},{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"http:\/\/api.hdfs.marathon.l4lb.thisdcos.directory\/v1\/endpoints\/core-site.xml"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\/a664123b-71b7-4af3-81b9-691082c18b82-S22\/frameworks\/a664123b-71b7-4af3-81b9-691082c18b82-0012\/executors\/driver-20170630074601-0013\/runs\/84e4daa2-86b3-4f37-894f-3195c59f4325"}

I can assume that hdfs confg files should fetched before contacting with HDFS cluster.

Spark version 2.1.1.

mgummelt commented 7 years ago

Hi @dbolshak

When launching your job, the Spark Dispatcher will fetch your jar via the Mesos Fetcher: http://mesos.apache.org/documentation/latest/fetcher/

By default, the Mesos Fetcher will fetch hdfs:// urls by shelling out to the hadoop binary on the machine. Is hadoop installed? If so, is it working properly? Is it configured with core-site.xml and hdfs-site.xml? Can you use it to fetch your jobo manually?

cc @susanxhuynh @ArtRand

dbolshak commented 7 years ago

Hello @mgummelt ,

Thanks for quick response and sorry for my delayed one.

Hadoop binaries are installed on all agents, but it's not configured. So the default hadoop configuration (/etc/hadoop/) is untouched. This directory has core-site.xml and hdfs-site.xml. Also I see that there is HADOOP_CONF_DIR evn var, that points to /etc/hadoop.

Of course, using such configuration it's not possible to get access to real hdfs, but it can not be a cause of core-site.xml not found.

And I would insist, that it should be possible to run Spark job without properly configured HDFS on agents (core-site.xml and hdfs-site.xml), because behaving by such way does not allow to run several independent HDFS services. Allowing to run only single HDFS service is a huge limitation, it means that mesos does not support multi tenancy, it's not possible to use single mesos to manage different environments (for example production and development).

About running job manually. Spark and HDFS work fine, fetching config files manually and running job manually works fine.

@susanxhuynh and @ArtRand , could you please join to this discussion? I still think that there is a problem somewhere.

Kind regards, Denis

mgummelt commented 7 years ago

@dbolshak Since this is an issue with the Mesos Fetcher, you'll have better luck asking for help on the DC/OS mailing lists. They can help you get your hdfs configured properly.

mesosphere / spark-build

wrong order while fetching resources #158