Task Lost when running multiple job when reaching the same sparkJob

jccimino commented 10 years ago

Hi,

We are facing an issue when we launch 2 Curl commands to JobServer. While the first command finishes successfully, the second one stuck in running mode forever. We have mesos 0.19.1 running, Spark 1.0.1 and JobServer compiled with Spark 1.0.1

Here after the code that we run for the sparkJob:

object Service extends SparkJob { def run(q:String, hc: org.apache.spark.sql.hive.HiveContext) = hc.hql(q).collect() override def runJob(sc: SparkContext, jobConfig: Config): Any = { val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val sqlQuery = jobConfig.getString("query") val priority = jobConfig.getString("priority") run(sqlQuery, hiveContext) }

We launch twice the following Curl command:

curl -d '{"query" : "select count(*) from test", "priority": "high"}' 'vm28-hulk-pub:8090/jobs?appName=samples-query-service&classPath=test.example.Service&sync=false'

When querying for the results we get the following:

{ "duration": "46.759 secs", "classPath": "test.example.Service", "startTime": "2014-08-04T10:58:50.349-04:00", "context": "1a6c1eb8-test.example.Service", "status": "FINISHED", "jobId": "e021d611-c403-41bc-afce-2d8ea7f59b60" }, { "duration": "Job not done yet", "classPath": "test.example.Service", "startTime": "2014-08-04T10:58:53.930-04:00", "context": "2dd46ecb-test.example.Service", "status": "RUNNING", "jobId": "1c3b9ccc-327a-4dd6-8aa3-c049d8b89d32"

On the jobserver.log we get messages that states for LOST TID as the following:

[2014-08-01 16:23:49,588] WARN k.scheduler.TaskSetManager [] [] - Lost TID 3 (task 1.0:3) [2014-08-01 16:23:49,589] WARN k.scheduler.TaskSetManager [] [] - Lost TID 6 (task 1.0:2) [2014-08-01 16:23:49,589] WARN k.scheduler.TaskSetManager [] [] - Lost TID 0 (task 1.0:0) [2014-08-01 16:23:49,590] INFO ark.scheduler.DAGScheduler [] [] - Executor lost: 20140714-142853-485682442-5050-25487-5 (epoch 2) [2014-08-01 16:23:49,590] INFO ge.BlockManagerMasterActor [] [] - Trying to remove executor 20140714-142853-485682442-5050-25487-5 from BlockManagerMaster. ..... [2014-08-04 10:59:26,403] WARN k.scheduler.TaskSetManager [] [] - Lost TID 11 (task 1.0:2) [2014-08-04 10:59:26,404] ERROR k.scheduler.TaskSetManager [] [] - Task 1.0:2 failed 4 times; aborting job [2014-08-04 10:59:26,408] INFO cheduler.TaskSchedulerImpl [] [] - Cancelling stage 1 [2014-08-04 10:59:26,410] INFO ark.scheduler.DAGScheduler [] [] - Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1070) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1056) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1056) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1056) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) ......

We tried many configuration for the jobserver.conf like:

job-number-cpus = 4 max-jobs-per-context = 32

without being able to make it work.

Any idea ?

Thanks,

Jean-Christophe.

velvia commented 10 years ago

Hey there,

Am currently on vacation, will look into it after I'm back.

-Evan "Never doubt that a small group of thoughtful, committed citizens can change the world" - M. Mead

On Aug 5, 2014, at 2:23 AM, jccimino notifications@github.com wrote:

Hi,

We are facing an issue when we launch 2 Curl commands to JobServer. While the first command finishes successfully, the second one stuck in running mode forever. We have mesos 0.19.1 running, Spark 1.0.1 and JobServer compiled with Spark 1.0.1

Here after the code that we run for the sparkJob:

object Service extends SparkJob { def run(q:String, hc: org.apache.spark.sql.hive.HiveContext) = hc.hql(q).collect() override def runJob(sc: SparkContext, jobConfig: Config): Any = { val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val sqlQuery = jobConfig.getString("query") val priority = jobConfig.getString("priority") run(sqlQuery, hiveContext)

}

We launch twice the following Curl command:

curl -d '{"query" : "select count(*) from test", "priority": "high"}' 'vm28-hulk-pub:8090/jobs?appName=samples-query-service&classPath=test.example.Service&sync=false'

When querying for the results we get the following:

{ "duration": "46.759 secs", "classPath": "test.example.Service", "startTime": "2014-08-04T10:58:50.349-04:00", "context": "1a6c1eb8-test.example.Service", "status": "FINISHED", "jobId": "e021d611-c403-41bc-afce-2d8ea7f59b60" }, { "duration": "Job not done yet", "classPath": "test.example.Service", "startTime": "2014-08-04T10:58:53.930-04:00", "context": "2dd46ecb-test.example.Service", "status": "RUNNING", "jobId": "1c3b9ccc-327a-4dd6-8aa3-c049d8b89d32"

On the jobserver.log we get messages that states for LOST TID as the following:

[2014-08-01 16:23:49,588] WARN k.scheduler.TaskSetManager [] [] - Lost TID 3 (task 1.0:3) [2014-08-01 16:23:49,589] WARN k.scheduler.TaskSetManager [] [] - Lost TID 6 (task 1.0:2) [2014-08-01 16:23:49,589] WARN k.scheduler.TaskSetManager [] [] - Lost TID 0 (task 1.0:0) [2014-08-01 16:23:49,590] INFO ark.scheduler.DAGScheduler [] [] - Executor lost: 20140714-142853-485682442-5050-25487-5 (epoch 2) [2014-08-01 16:23:49,590] INFO ge.BlockManagerMasterActor [] [] - Trying to remove executor 20140714-142853-485682442-5050-25487-5 from BlockManagerMaster. ..... [2014-08-04 10:59:26,403] WARN k.scheduler.TaskSetManager [] [] - Lost TID 11 (task 1.0:2) [2014-08-04 10:59:26,404] ERROR k.scheduler.TaskSetManager [] [] - Task 1.0:2 failed 4 times; aborting job [2014-08-04 10:59:26,408] INFO cheduler.TaskSchedulerImpl [] [] - Cancelling stage 1 [2014-08-04 10:59:26,410] INFO ark.scheduler.DAGScheduler [] [] - Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1070) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1056) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1056) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1056) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) ..... .

We tried many configuration for the jobserver.conf like:

job-number-cpus = 4 max-jobs-per-context = 32

without being able to make it work.

Any idea ?

Thanks,

Jean-Christophe.

— Reply to this email directly or view it on GitHub.

JulienPotvin commented 9 years ago

Hi Evan,

Do you have any news on this issue? I am able to run multiple jobs within the same context, but if I try to run jobs on more than one context at a time, there are broadcast issues. I read somewhere that Spark doesn't handle multiple SparkContext within a single process very well. How does the Jobserver handle multiple Context?

Thank you for your time and insight! Julien

velvia commented 9 years ago

Hi Julien,

I've been able to run with multiple contexts before. Have you made sure that both contexts have enough resources (cpu/mem)?

-Evan

On Tue, Oct 28, 2014 at 2:07 PM, JulienPotvin notifications@github.com wrote:

Hi Evan,

Do you have any news on this issue? I am able to run multiple jobs within the same context, but if I try to run jobs on more than one context at a time, there are broadcast issues. I read somewhere that Spark doesn't handle multiple SparkContext within a single process very well. How does the Jobserver handle multiple Context?

Thank you for your time and insight! Julien

— Reply to this email directly or view it on GitHub https://github.com/ooyala/spark-jobserver/issues/47#issuecomment-60802979 .

The fruit of silence is prayer; the fruit of prayer is faith; the fruit of faith is love; the fruit of love is service; the fruit of service is peace. -- Mother Teresa

JulienPotvin commented 9 years ago

Hi Evan,

I believe both context were allocated enough resource, given they complete without problem when run individually. I posted more details on the google group : https://groups.google.com/forum/#!topic/spark-jobserver/1CBs0vVnNmc

Thanks again Julien

ooyala / spark-jobserver

Task Lost when running multiple job when reaching the same sparkJob #47