Open chipsenkbeil opened 10 years ago
FYI, I also tried updating deimos to 0.4.0, which was available by doing `pip install deimos --upgrade). Nothing has changed with the above issue.
I might have allowed Ubuntu to update the kernel through a standard upgrade. Would I need to reinstall anything to do with Mesos after that?
Try with --default_container_image=docker:///libmesos/ubuntu:13.10
.
I tried that and it still is either ignoring that option or passing in ubuntu:14.04 directly:
@rcsenkbeil I believe though I may be wrong, the deimos config file default image takes precedence over the one you configure on the mesos slave. Are you using the example config? https://github.com/mesosphere/deimos/blob/master/example.cfg#L24.
Yeah, I also added the example config. Since there wasn't too much documentation on each individual option, I just guessed on what to do.
The config located at /home/mesos/.deimos:
The results of running deimos config
:
Any chance you can dump your deimos logs here? I notice you have console: DEBUG
enabled in your configuration.
Another thing I can think of is that mesos master/slave/marathon are all running as root since I installed them all through deb packages rather than building from source. I guess I'd need to move the deimos config to the root level for it to apply to the running instances?
But, if that's the case, there shouldn't be any config affecting mesos-master and mesos-slave, right?
The results of running deimos config
as root:
@tarnfeld Sorry, I literally installed and got everything running yesterday. Where would the log you want be found?
Looks like the console logging is written to the stderr file in the sandbox, can you paste the full output of that? You can get to that through the Mesos Web UI.
Ah, okay. Also, it's showing up in var/log/syslog:
In terms of the stderr file, this is what I see:
Unable to find image 'libmesos/ubuntu:14.04' locally Pulling repository libmesos/ubuntu 2014/07/10 11:21:57 Tag 14.04 not found in repository libmesos/ubuntu
A new update. I moved the .deimos file to /root since mesos was running under root. I deployed a new spark job and this is what I see:
Unable to find image 'ubuntu:13.10' locally
Pulling repository ubuntu
nix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/1d9f565c-43a9-4467-a105-2be15524a086/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/1d9f565c-43a9-4467-a105-2be15524a086/fs:/tmp/mesos-sandbox -c 1024 -m 2048m -e SPARK_USER=mesos -e SPARK_EXECUTOR_MEMORY=2048m -e MESOS_SLAVE_ID=20140710-110412-16842879-5050-1441-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-110412-16842879-5050-1441-0004 -e MESOS_EXECUTOR_ID=20140710-110412-16842879-5050-1441-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:31:43.415 deimos.state.await_cid() Awaiting CID file: /tmp/deimos/mesos/1d9f565c-43a9-4467-a105-2be15524a086/cid
11:32:56.925 deimos.cli() deimos.state.CIDTimeout: No CID file after 60s
I'm going to update the VM with more resources to see if it is just really sluggish.
And, now I can launch Spark, but all of the tasks become lost. From one of the tasks, I see the following in stderr:
WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
JAVA_HOME is not set
00
11:37:15.791 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/d93c6db7-42fa-4083-9c7d-d9e4cb10c4af/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/d93c6db7-42fa-4083-9c7d-d9e4cb10c4af/fs:/tmp/mesos-sandbox -c 1024 -m 2048m -e SPARK_USER=ibm -e SPARK_EXECUTOR_MEMORY=2048m -e MESOS_SLAVE_ID=20140710-113632-16842879-5050-1568-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-113632-16842879-5050-1568-0000 -e MESOS_EXECUTOR_ID=20140710-113632-16842879-5050-1568-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:37:15.799 deimos.state.await_cid() Awaiting CID file: /tmp/deimos/mesos/d93c6db7-42fa-4083-9c7d-d9e4cb10c4af/cid
11:37:16.804 deimos.state.lock() request // wait EX (60s)
11:37:16.807 deimos.state.lock() success // wait EX (60s)
11:37:16.810 deimos.containerizer.docker.launch() eid = 20140710-113632-16842879-5050-1568-0
11:37:16.811 deimos.containerizer.docker.launch() mesos = d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
11:37:16.810 deimos.containerizer.docker.launch() eid = 20140710-113632-16842879-5050-1568-0
11:37:16.811 deimos.containerizer.docker.launch() docker = a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.811 deimos.containerizer.docker.launch() mesos = d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
11:37:16.812 deimos.containerizer.docker.launch() docker = a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.812 deimos.containerizer.docker.launch() Forking watcher into child...
11:37:16.812 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock wait a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.813 deimos.usage.self() rss = 16.047M user = 0.158 sys = 0.074
11:37:16.813 deimos.usage.children() rss = 10.777M user = 1.991 sys = 2.107
11:37:16.858 deimos.containerizer.docker.launch() exit 1 // docker --host unix:///var/run/docker.sock wait a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:16.859 deimos.containerizer.docker.launch() STDERR // Error: wait: no such container: a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
2014/07/10 11:37:16 Error: failed to wait one or more containers
11:37:16.860 deimos.cli() Command '['docker', '--host', u'unix:///var/run/docker.sock', 'wait', 'a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515']' returned non-zero exit status 1
11:37:17.585 deimos.config.load_configuration() Loaded configuration from /root/.deimos
11:37:17.586 deimos.config.load_configuration() Found: deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None)
11:37:17.586 deimos.config.load_configuration() Found: deimos.config.State(root='/tmp/deimos')
11:37:17.586 deimos.config.load_configuration() Found: deimos.config.Docker(host=[u'unix:///var/run/docker.sock'])
11:37:17.587 deimos.config.load_configuration() Found: deimos.config.Log(syslog=20, console=10)
11:37:17.587 deimos.config.load_configuration() Found: deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[]))
11:37:17.588 deimos.usage.self() rss = 16.043M user = 0.121 sys = 0.045
11:37:17.588 deimos.usage.children() rss = 0.000M user = 0.000 sys = 0.000
11:37:17.589 deimos.containerizer.stdio() (<bound method Docker.wait of deimos.containerizer.docker.Docker(index_settings=deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None), workdir='/tmp/mesos-sandbox', state_root='/tmp/deimos', runner=None, optimistic_unpack=True, state=None, shared_dir='fs', container_settings=deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[])))>, <class 'containerizer_pb2.Wait'>)
11:37:17.589 deimos.proto.deserialize() container_id {
11:37:17.590 deimos.proto.deserialize() value: "d93c6db7-42fa-4083-9c7d-d9e4cb10c4af"
11:37:17.590 deimos.proto.deserialize() }
11:37:17.591 deimos.containerizer.docker.wait()
11:37:17.592 deimos.state.lock() request // launch SH (60s)
11:37:17.593 deimos.state.lock() success // launch SH (60s)
11:37:17.594 deimos.containerizer.docker.wait() eid = 20140710-113632-16842879-5050-1568-0
11:37:17.594 deimos.containerizer.docker.wait() mesos = d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
11:37:17.595 deimos.containerizer.docker.wait() docker = a2f33eeb4a31c0486ed55eb9e7b0ea9f01b597d34f474ea74af74152ff76f515
11:37:17.595 deimos.state.lock() request // observe SH (indefinite)
11:37:17.596 deimos.state.lock() success // observe SH (indefinite)
11:37:17.597 deimos.state.lock() request // wait SH (indefinite)
11:37:17.598 deimos.state.lock() success // wait SH (indefinite)
11:37:17.598 deimos.proto.serialize() Termination.status = 16384
11:37:17.599 deimos.proto.serialize() Termination.message = ''
11:37:17.599 deimos.proto.serialize() Termination.killed = False
11:37:17.600 deimos.cli() deimos.err.Err: Wait lock is not held nor is exit file present
And the following in stdout:
Running spark-executor with framework dir = /tmp/mesos-sandbox/spark-1.0.0-bin-hadoop2
[EDIT]
Below is the output from running spark-submit:
ibm@ibm-VirtualBox:/opt/spark-1.0.0-bin-hadoop2$ spark-submit --class org.apache.spark.examples.SparkPi --master mesos://127.0.1.1:5050 --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 lib/spark-examples*.jar 10
14/07/10 11:41:52 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/07/10 11:41:52 WARN Utils: Your hostname, ibm-VirtualBox resolves to a loopback address: 127.0.1.1; using 172.17.42.1 instead (on interface docker0)
14/07/10 11:41:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
14/07/10 11:41:52 INFO SecurityManager: Changing view acls to: ibm
14/07/10 11:41:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ibm)
14/07/10 11:41:53 INFO Slf4jLogger: Slf4jLogger started
14/07/10 11:41:53 INFO Remoting: Starting remoting
14/07/10 11:41:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@172.17.42.1:58372]
14/07/10 11:41:53 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@172.17.42.1:58372]
14/07/10 11:41:54 INFO SparkEnv: Registering MapOutputTracker
14/07/10 11:41:54 INFO SparkEnv: Registering BlockManagerMaster
14/07/10 11:41:54 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140710114154-7601
14/07/10 11:41:54 INFO MemoryStore: MemoryStore started with capacity 589.2 MB.
14/07/10 11:41:54 INFO ConnectionManager: Bound socket to port 54642 with id = ConnectionManagerId(172.17.42.1,54642)
14/07/10 11:41:54 INFO BlockManagerMaster: Trying to register BlockManager
14/07/10 11:41:54 INFO BlockManagerInfo: Registering block manager 172.17.42.1:54642 with 589.2 MB RAM
14/07/10 11:41:54 INFO BlockManagerMaster: Registered BlockManager
14/07/10 11:41:54 INFO HttpServer: Starting HTTP Server
14/07/10 11:41:54 INFO HttpBroadcast: Broadcast server started at http://172.17.42.1:40132
14/07/10 11:41:54 INFO HttpFileServer: HTTP File server directory is /tmp/spark-2b706d82-7321-47a9-a6ed-f884b5d4cfa0
14/07/10 11:41:54 INFO HttpServer: Starting HTTP Server
14/07/10 11:41:54 INFO SparkUI: Started SparkUI at http://172.17.42.1:4040
14/07/10 11:41:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/07/10 11:41:55 INFO EventLoggingListener: Logging events to /tmp/sparklogs/spark-pi-1405010515376
14/07/10 11:41:55 INFO SparkContext: Added JAR file:/opt/spark-1.0.0-bin-hadoop2/lib/spark-examples-1.0.0-hadoop2.2.0.jar at http://172.17.42.1:58841/jars/spark-examples-1.0.0-hadoop2.2.0.jar with timestamp 1405010515911
I0710 11:41:56.042367 3435 sched.cpp:126] Version: 0.19.0
I0710 11:41:56.044584 3439 sched.cpp:222] New master detected at master@127.0.1.1:5050
I0710 11:41:56.044920 3439 sched.cpp:230] No credentials provided. Attempting to register without authentication
I0710 11:41:56.047376 3439 sched.cpp:397] Framework registered with 20140710-113632-16842879-5050-1568-0002
14/07/10 11:41:56 INFO MesosSchedulerBackend: Registered as framework ID 20140710-113632-16842879-5050-1568-0002
14/07/10 11:41:56 INFO SparkContext: Starting job: reduce at SparkPi.scala:35
14/07/10 11:41:56 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 10 output partitions (allowLocal=false)
14/07/10 11:41:56 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkPi.scala:35)
14/07/10 11:41:56 INFO DAGScheduler: Parents of final stage: List()
14/07/10 11:41:56 INFO DAGScheduler: Missing parents: List()
14/07/10 11:41:56 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkPi.scala:31), which has no missing parents
14/07/10 11:41:56 INFO DAGScheduler: Submitting 10 missing tasks from Stage 0 (MappedRDD[1] at map at SparkPi.scala:31)
14/07/10 11:41:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 5 ms
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:1 as 1411 bytes in 0 ms
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 0 ms
14/07/10 11:41:56 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:41:56 INFO TaskSetManager: Serialized task 0.0:3 as 1411 bytes in 0 ms
W0710 11:41:56.416663 3438 sched.cpp:901] Attempting to launch task 1 with an unknown offer 20140710-113632-16842879-5050-1568-66
W0710 11:41:56.416805 3438 sched.cpp:901] Attempting to launch task 2 with an unknown offer 20140710-113632-16842879-5050-1568-66
W0710 11:41:56.416862 3438 sched.cpp:901] Attempting to launch task 3 with an unknown offer 20140710-113632-16842879-5050-1568-66
14/07/10 11:42:01 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 2 (task 0.0:2)
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 3 (task 0.0:3)
14/07/10 11:42:01 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/10 11:42:01 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 0)
14/07/10 11:42:01 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:01 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:0 as TID 4 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 0 ms
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:3 as TID 5 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:3 as 1411 bytes in 0 ms
14/07/10 11:42:01 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:1 as TID 6 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:1 as 1411 bytes in 1 ms
14/07/10 11:42:01 INFO TaskSetManager: Starting task 0.0:2 as TID 7 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:01 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 1 ms
W0710 11:42:01.881386 3440 sched.cpp:901] Attempting to launch task 5 with an unknown offer 20140710-113632-16842879-5050-1568-67
W0710 11:42:01.881475 3440 sched.cpp:901] Attempting to launch task 6 with an unknown offer 20140710-113632-16842879-5050-1568-67
W0710 11:42:01.881731 3440 sched.cpp:901] Attempting to launch task 7 with an unknown offer 20140710-113632-16842879-5050-1568-67
14/07/10 11:42:07 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 5 (task 0.0:3)
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 7 (task 0.0:2)
14/07/10 11:42:07 WARN TaskSetManager: Lost TID 6 (task 0.0:1)
14/07/10 11:42:07 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 1)
14/07/10 11:42:07 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:07 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:07 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:1 as TID 8 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:1 as 1411 bytes in 1 ms
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:2 as TID 9 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 0 ms
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:0 as TID 10 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 0 ms
14/07/10 11:42:07 INFO TaskSetManager: Starting task 0.0:3 as TID 11 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:07 INFO TaskSetManager: Serialized task 0.0:3 as 1411 bytes in 0 ms
W0710 11:42:07.820994 3437 sched.cpp:901] Attempting to launch task 9 with an unknown offer 20140710-113632-16842879-5050-1568-70
W0710 11:42:07.821722 3437 sched.cpp:901] Attempting to launch task 10 with an unknown offer 20140710-113632-16842879-5050-1568-70
W0710 11:42:07.821792 3437 sched.cpp:901] Attempting to launch task 11 with an unknown offer 20140710-113632-16842879-5050-1568-70
14/07/10 11:42:13 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 8 (task 0.0:1)
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 11 (task 0.0:3)
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 10 (task 0.0:0)
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 9 (task 0.0:2)
14/07/10 11:42:13 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 2)
14/07/10 11:42:13 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:13 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:13 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
14/07/10 11:42:13 INFO TaskSetManager: Starting task 0.0:2 as TID 12 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:13 INFO TaskSetManager: Serialized task 0.0:2 as 1411 bytes in 1 ms
14/07/10 11:42:13 INFO TaskSetManager: Starting task 0.0:0 as TID 13 on executor 20140710-113632-16842879-5050-1568-0: ibm-VirtualBox (PROCESS_LOCAL)
14/07/10 11:42:13 INFO TaskSetManager: Serialized task 0.0:0 as 1411 bytes in 0 ms
14/07/10 11:42:13 INFO TaskSetManager: Re-queueing tasks for 20140710-113632-16842879-5050-1568-0 from TaskSet 0.0
14/07/10 11:42:13 WARN TaskSetManager: Lost TID 13 (task 0.0:0)
14/07/10 11:42:13 ERROR TaskSetManager: Task 0.0:0 failed 4 times; aborting job
14/07/10 11:42:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
W0710 11:42:13.820405 3438 sched.cpp:901] Attempting to launch task 13 with an unknown offer 20140710-113632-16842879-5050-1568-72
14/07/10 11:42:13 INFO DAGScheduler: Failed to run reduce at SparkPi.scala:35
Exception in thread "main" 14/07/10 11:42:13 INFO TaskSchedulerImpl: Cancelling stage 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 13 on host ibm-VirtualBox failed for unknown reason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/07/10 11:42:13 INFO DAGScheduler: Executor lost: 20140710-113632-16842879-5050-1568-0 (epoch 3)
14/07/10 11:42:13 INFO BlockManagerMasterActor: Trying to remove executor 20140710-113632-16842879-5050-1568-0 from BlockManagerMaster.
14/07/10 11:42:13 INFO BlockManagerMaster: Removed 20140710-113632-16842879-5050-1568-0 successfully in removeExecutor
14/07/10 11:42:13 INFO DAGScheduler: Host added was in lost list earlier: ibm-VirtualBox
Ok, progress :-) So it looks like the docker container is failing to launch (by that I mean the executor is not launching successfully inside it) – You can see the container ID as d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
. Try running docker logs d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
and see what comes up, you might see some spark errors that help you debug.
Weird, it claims the container doesn't exist. There isn't any sort of cleanup of containers, is there?
ibm@ibm-VirtualBox:/opt/spark-1.0.0-bin-hadoop2$ sudo docker logs d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
2014/07/10 11:48:41 Error: No such container: d93c6db7-42fa-4083-9c7d-d9e4cb10c4af
I tried it with a short segment sudo docker logs d93c6
with the same results.
[EDIT]
In fact, there doesn't seem to be any previously-run containers:
So, I tried updating the deimos config that is being used and replacing docker:///ubuntu:13.10
with docker:///libmesos/ubuntu:13.10
to see what would happen. It looks like the mesos library couldn't be found? I thought libmesos/ubuntu was built to contain that library?
This is the new error log:
WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/07/10 16:55:11 INFO SparkHadoopUtil: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Failed to load native Mesos library from /usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
Exception in thread "main" java.lang.UnsatisfiedLinkError: no mesos in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:52)
at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:64)
at org.apache.spark.executor.MesosExecutorBackend$$anonfun$main$1.apply$mcV$sp(MesosExecutorBackend.scala:99)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.MesosExecutorBackend$.main(MesosExecutorBackend.scala:98)
at org.apache.spark.executor.MesosExecutorBackend.main(MesosExecutorBackend.scala)
7 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock wait 71de134814fe64d5094321ce54fca132d1fe14e9cd7163ed5461ea6708c9a3cd
11:55:10.148 deimos.usage.self() rss = 16.047M user = 0.157 sys = 0.072
11:55:10.148 deimos.usage.children() rss = 10.777M user = 1.982 sys = 1.986
11:55:11.090 deimos.config.load_configuration() Loaded configuration from /root/.deimos
11:55:11.091 deimos.config.load_configuration() Found: deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None)
11:55:11.091 deimos.config.load_configuration() Found: deimos.config.State(root='/tmp/deimos')
11:55:11.092 deimos.config.load_configuration() Found: deimos.config.Docker(host=[u'unix:///var/run/docker.sock'])
11:55:11.092 deimos.config.load_configuration() Found: deimos.config.Log(syslog=20, console=10)
11:55:11.092 deimos.config.load_configuration() Found: deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///libmesos/ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[]))
11:55:11.093 deimos.usage.self() rss = 16.047M user = 0.115 sys = 0.045
11:55:11.094 deimos.usage.children() rss = 0.000M user = 0.000 sys = 0.000
11:55:11.094 deimos.containerizer.stdio() (<bound method Docker.wait of deimos.containerizer.docker.Docker(index_settings=deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None), workdir='/tmp/mesos-sandbox', state_root='/tmp/deimos', runner=None, optimistic_unpack=True, state=None, shared_dir='fs', container_settings=deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///libmesos/ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[])))>, <class 'containerizer_pb2.Wait'>)
11:55:11.095 deimos.proto.deserialize() container_id {
11:55:11.095 deimos.proto.deserialize() value: "8c2312fb-3fec-4ac0-941a-db0f587e01d8"
11:55:11.096 deimos.proto.deserialize() }
11:55:11.096 deimos.containerizer.docker.wait()
11:55:11.098 deimos.state.lock() request // launch SH (60s)
11:55:11.098 deimos.state.lock() success // launch SH (60s)
11:55:11.099 deimos.containerizer.docker.wait() eid = 20140710-113632-16842879-5050-1568-0
11:55:11.099 deimos.containerizer.docker.wait() mesos = 8c2312fb-3fec-4ac0-941a-db0f587e01d8
11:55:11.100 deimos.containerizer.docker.wait() docker = 71de134814fe64d5094321ce54fca132d1fe14e9cd7163ed5461ea6708c9a3cd
11:55:11.100 deimos.state.lock() request // observe SH (indefinite)
11:55:11.101 deimos.state.lock() success // observe SH (indefinite)
11:55:11.102 deimos.state.lock() request // wait SH (indefinite)
11:55:11.375 deimos.containerizer.docker.launch() exit 0 // docker --host unix:///var/run/docker.sock wait 71de134814fe64d5094321ce54fca132d1fe14e9cd7163ed5461ea6708c9a3cd
11:55:11.376 deimos.containerizer.docker.launch() STDOUT // 1
11:55:11.377 deimos.state.lock() success // wait SH (indefinite)
11:55:11.378 deimos.proto.serialize() Termination.status = 256
11:55:11.378 deimos.proto.serialize() Termination.message = ''
11:55:11.379 deimos.proto.serialize() Termination.killed = False
11:55:11.380 deimos.usage.self() rss = 16.047M user = 0.124 sys = 0.049
11:55:11.380 deimos.usage.children() rss = 0.000M user = 0.000 sys = 0.000
11:55:11.383 deimos.containerizer.docker.launch() exit 0 // docker --host unix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/8c2312fb-3fec-4ac0-941a-db0f587e01d8/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/8c2312fb-3fec-4ac0-941a-db0f587e01d8/fs:/tmp/mesos-sandbox -c 1024 -m 512m -e SPARK_USER=ibm -e SPARK_EXECUTOR_MEMORY=512m -e MESOS_SLAVE_ID=20140710-113632-16842879-5050-1568-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-113632-16842879-5050-1568-0005 -e MESOS_EXECUTOR_ID=20140710-113632-16842879-5050-1568-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox libmesos/ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:55:11.385 deimos.usage.self() rss = 11.770M user = 0.007 sys = 0.006
11:55:11.385 deimos.usage.children() rss = 10.777M user = 0.018 sys = 0.039
To reiterate, the previous stderr from using docker:///ubuntu:13.10
was the following:
WARNING: Your kernel does not support swap limit capabilities. Limitation discarded.
JAVA_HOME is not set
06
11:57:33.447 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock run --sig-proxy --rm --cidfile /tmp/deimos/mesos/4ff94f11-2671-48e2-bfbf-6b077eb60bf8/cid -w /tmp/mesos-sandbox -v /tmp/deimos/mesos/4ff94f11-2671-48e2-bfbf-6b077eb60bf8/fs:/tmp/mesos-sandbox -c 1024 -m 512m -e SPARK_USER=ibm -e SPARK_EXECUTOR_MEMORY=512m -e MESOS_SLAVE_ID=20140710-113632-16842879-5050-1568-0 -e 'MESOS_SLAVE_PID=slave(1)@127.0.1.1:5051' -e MESOS_FRAMEWORK_ID=20140710-113632-16842879-5050-1568-0006 -e MESOS_EXECUTOR_ID=20140710-113632-16842879-5050-1568-0 -e MESOS_CHECKPOINT=0 -e MESOS_DIRECTORY=/tmp/mesos-sandbox ubuntu:13.10 sh -c 'cd spark-1*; ./sbin/spark-executor'
11:57:33.457 deimos.state.await_cid() Awaiting CID file: /tmp/deimos/mesos/4ff94f11-2671-48e2-bfbf-6b077eb60bf8/cid
11:57:34.461 deimos.state.lock() request // wait EX (60s)
11:57:34.462 deimos.state.lock() success // wait EX (60s)
11:57:34.467 deimos.containerizer.docker.launch() eid = 20140710-113632-16842879-5050-1568-0
11:57:34.468 deimos.containerizer.docker.launch() mesos = 4ff94f11-2671-48e2-bfbf-6b077eb60bf8
11:57:34.467 deimos.containerizer.docker.launch() eid = 20140710-113632-16842879-5050-1568-0
11:57:34.469 deimos.containerizer.docker.launch() docker = d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.470 deimos.containerizer.docker.launch() mesos = 4ff94f11-2671-48e2-bfbf-6b077eb60bf8
11:57:34.470 deimos.containerizer.docker.launch() docker = d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.470 deimos.containerizer.docker.launch() Forking watcher into child...
11:57:34.472 deimos.containerizer.docker.launch() call // docker --host unix:///var/run/docker.sock wait d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.473 deimos.usage.self() rss = 16.047M user = 0.145 sys = 0.064
11:57:34.473 deimos.usage.children() rss = 10.777M user = 1.968 sys = 1.972
11:57:34.540 deimos.containerizer.docker.launch() exit 1 // docker --host unix:///var/run/docker.sock wait d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:34.541 deimos.containerizer.docker.launch() STDERR // Error: wait: no such container: d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
2014/07/10 11:57:34 Error: failed to wait one or more containers
11:57:34.542 deimos.cli() Command '['docker', '--host', u'unix:///var/run/docker.sock', 'wait', 'd7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471']' returned non-zero exit status 1
11:57:35.265 deimos.config.load_configuration() Loaded configuration from /root/.deimos
11:57:35.265 deimos.config.load_configuration() Found: deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None)
11:57:35.266 deimos.config.load_configuration() Found: deimos.config.State(root='/tmp/deimos')
11:57:35.266 deimos.config.load_configuration() Found: deimos.config.Docker(host=[u'unix:///var/run/docker.sock'])
11:57:35.266 deimos.config.load_configuration() Found: deimos.config.Log(syslog=20, console=10)
11:57:35.267 deimos.config.load_configuration() Found: deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[]))
11:57:35.267 deimos.usage.self() rss = 16.047M user = 0.104 sys = 0.039
11:57:35.268 deimos.usage.children() rss = 0.000M user = 0.000 sys = 0.000
11:57:35.268 deimos.containerizer.stdio() (<bound method Docker.wait of deimos.containerizer.docker.Docker(index_settings=deimos.config.DockerIndex(account_libmesos='libmesos', index='interior-node:2222', account=None, dockercfg=None), workdir='/tmp/mesos-sandbox', state_root='/tmp/deimos', runner=None, optimistic_unpack=True, state=None, shared_dir='fs', container_settings=deimos.config.Containers(image=deimos.config.Image(ignore=False, default='docker:///ubuntu:13.10'), options=deimos.config.Options(ignore=False, default=[], append=[])))>, <class 'containerizer_pb2.Wait'>)
11:57:35.269 deimos.proto.deserialize() container_id {
11:57:35.269 deimos.proto.deserialize() value: "4ff94f11-2671-48e2-bfbf-6b077eb60bf8"
11:57:35.270 deimos.proto.deserialize() }
11:57:35.270 deimos.containerizer.docker.wait()
11:57:35.271 deimos.state.lock() request // launch SH (60s)
11:57:35.272 deimos.state.lock() success // launch SH (60s)
11:57:35.272 deimos.containerizer.docker.wait() eid = 20140710-113632-16842879-5050-1568-0
11:57:35.273 deimos.containerizer.docker.wait() mesos = 4ff94f11-2671-48e2-bfbf-6b077eb60bf8
11:57:35.273 deimos.containerizer.docker.wait() docker = d7e2dafbcfce4514e82f03ffeefe0b4c772f6a0a96de5f5c2b134d4ce7c67471
11:57:35.274 deimos.state.lock() request // observe SH (indefinite)
11:57:35.274 deimos.state.lock() success // observe SH (indefinite)
11:57:35.275 deimos.state.lock() request // wait SH (indefinite)
11:57:35.276 deimos.state.lock() success // wait SH (indefinite)
11:57:35.276 deimos.proto.serialize() Termination.status = 16384
11:57:35.277 deimos.proto.serialize() Termination.message = ''
11:57:35.277 deimos.proto.serialize() Termination.killed = False
11:57:35.278 deimos.cli() deimos.err.Err: Wait lock is not held nor is exit file present
Another interesting thing to note is that the containers only show up when listing all processes (including non-running ones) for a split second. So, I'm guessing they are being removed rather than being allowed to persist?
watch -i 0.1 sudo docker ps -a
[EDIT] Also, I want to confirm that deploying Docker containers through Marathon works fine. Maybe Spark itself cannot run in Docker with the way that Deimos handles it?
[EDIT2] Finally, going back to not using deimos (and in turn not using Docker as the containerization), I can run the Spark framework without issues and I can deploy tasks through Marathon. This includes deploying a command that runs a Docker container. I don't know how well Docker plays inside Mesos' default container system, but I can run Docker through the normal setup, at least.
For me, I'd love to have the ability to let Marathon use deimos and the Spark framework use the default containerization, which I see is being worked on. It would allow me to deploy standard Docker containers through a nice RESTful API.
I believe that libmesos
gets placed in /usr/local/lib
by default; or at least, that used to be the case. (The libmesos/ubuntu
container was built against an RC for Mesos 0.19.0.) Maybe java.library.path
does not include /usr/local
when Spark is run?
You could change account_libmesos
to point to a different Docker account and create a new container in that account, tagged as ubuntu:14.04
. Then you'd have control over the Java version as well as the other supporting libraries.
Thanks for post. I went through the same process. Creating a new Docker container with libmesos in /usr/lib solved the UnsatisfiedLinkError.
Yes. I also managed to work spark with deimos, the only thing I couldn't do is, didn't know how to pass parameters when submitting a job (different from deimos.cfg). Eg: which image should be used. (I only ran with the same image)
@tdna That's currently not supported by spark. It's something @tnachen or I will be adding over the coming few days. Feel free to contribute, too!
Thats cool, thanks!
So, I was fiddling with Mesos on a single-node machine. I'd been using it with Spark and then found out about Deimos to utilize Docker as an external containerization medium.
It worked up until today. I've got this installed on an Ubuntu 14.04 64-bit installation in Virtualbox. Mesos master and slave are working. I also have Marathon running. Now, when I try to submit a Spark job, the first stage in the pipeline has a stderr output of the following:
I added to /etc/mesos-slave the file _default_containerimage with libmesos/ubuntu:13.10 inside it. After rebooting the VM, I can see that the mesos slave is started with that option:
So, I'm not sure if ubuntu:14.04 is suddenly being fed into the system or what, but I haven't touched anything aside from rebooting the image. I can't figure out how to get the Spark submission to not try to download libmesos/ubuntu:14.04 (and I'm not going through Marathon for this, obviously).