microsoft / Mobius

C# and F# language binding and extensions to Apache Spark
MIT License
943 stars 213 forks source link

Error in stopping SparkContext causes failed status in YARN cluster mode. #642

Open tfiasco opened 7 years ago

tfiasco commented 7 years ago

Hello, I'm running a Mobius application in yarn cluster mode. When the application is nearly finished, an error occurs in Spark logs on the failure to delete userFiles directory under Spark temp directory. Mobius tells me the error may be ignored for now, but the error reports a failed status to yarn cluster (which is actully successful). In consequence, yarn starts another node to run this job, and this job failed either. I'm running the application in a cluster with two nodes, so the application is forced to run twice by yarn. How can I avoid the error in deleting temp directory?

Thanks very much.

tfiasco commented 7 years ago

I reconsidered this problem, and I'm not sure if the failed status is the result of failure in deleting temp directory or not.

This is the output of console:

test1@Master:/usr/lib/sparkclr$ sparkclr-submit.sh --master yarn --deploy-mode cluster --exe SparkClrPi.exe /usr/lib/sparkclr/examples/Batch/pi/
[sparkclr-submit.sh] LAUNCH_CLASSPATH=/usr/lib/sparkclr/runtime/lib/spark-clr_2.11-2.0.200.jar:/usr/local/spark-2.0.2-bin-hadoop2.7/jars/*
Zip driver directory /usr/lib/sparkclr/examples/Batch/pi to /tmp/pi_1490072786367.zip
[sparkclr-submit.sh] Command to run  --master yarn --deploy-mode cluster --name SparkClrPi --files /tmp/pi_1490072786367.zip --class org.apache.spark.deploy.csharp.CSharpRunner /usr/lib/sparkclr/runtime/lib/spark-clr_2.11-2.0.200.jar pi_1490072786367.zip SparkClrPi.exe
17/03/20 22:06:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/20 22:06:28 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.169.141:8032
17/03/20 22:06:28 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
17/03/20 22:06:28 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
17/03/20 22:06:28 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
17/03/20 22:06:28 INFO yarn.Client: Setting up container launch context for our AM
17/03/20 22:06:28 INFO yarn.Client: Setting up the launch environment for our AM container
17/03/20 22:06:28 INFO yarn.Client: Preparing resources for our AM container
17/03/20 22:06:28 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://Master:9000/spark-jars/JavaEWAH-0.3.2.jar
17/03/20 22:06:28 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://Master:9000/spark-jars/RoaringBitmap-0.5.11.jar
17/03/20 22:06:29 INFO yarn.Client: Uploading resource file:/usr/lib/sparkclr/runtime/lib/spark-clr_2.11-2.0.200.jar -> hdfs://Master:9000/user/test1/.sparkStaging/application_1490070587984_0007/spark-clr_2.11-2.0.200.jar
17/03/20 22:06:29 INFO yarn.Client: Uploading resource file:/tmp/pi_1490072786367.zip -> hdfs://Master:9000/user/test1/.sparkStaging/application_1490070587984_0007/pi_1490072786367.zip
17/03/20 22:06:29 INFO yarn.Client: Uploading resource file:/tmp/spark-a244ed7e-5265-4a80-a97e-7cd4a83425c6/__spark_conf__4813657159454156389.zip -> hdfs://Master:9000/user/test1/.sparkStaging/application_1490070587984_0007/__spark_conf__.zip
17/03/20 22:06:29 INFO spark.SecurityManager: Changing view acls to: test1
17/03/20 22:06:29 INFO spark.SecurityManager: Changing modify acls to: test1
17/03/20 22:06:29 INFO spark.SecurityManager: Changing view acls groups to:
17/03/20 22:06:29 INFO spark.SecurityManager: Changing modify acls groups to:
17/03/20 22:06:29 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(test1); groups with view permissions: Set(); users  with modify permissions: Set(test1); groups with modify permissions: Set()
17/03/20 22:06:29 INFO yarn.Client: Submitting application application_1490070587984_0007 to ResourceManager
17/03/20 22:06:29 INFO impl.YarnClientImpl: Submitted application application_1490070587984_0007
17/03/20 22:06:30 INFO yarn.Client: Application report for application_1490070587984_0007 (state: ACCEPTED)
17/03/20 22:06:30 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1490072789938
         final status: UNDEFINED
         tracking URL: http://Master:8088/proxy/application_1490070587984_0007/
         user: test1
17/03/20 22:06:31 INFO yarn.Client: Application report for application_1490070587984_0007 (state: ACCEPTED)
17/03/20 22:06:32 INFO yarn.Client: Application report for application_1490070587984_0007 (state: ACCEPTED)
17/03/20 22:06:33 INFO yarn.Client: Application report for application_1490070587984_0007 (state: ACCEPTED)
17/03/20 22:06:34 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:34 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 192.168.169.129
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1490072789938
         final status: UNDEFINED
         tracking URL: http://Master:8088/proxy/application_1490070587984_0007/
         user: test1
17/03/20 22:06:35 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:36 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:37 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:38 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:39 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:40 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:41 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:42 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:43 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:45 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:46 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:47 INFO yarn.Client: Application report for application_1490070587984_0007 (state: ACCEPTED)
17/03/20 22:06:47 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1490072789938
         final status: UNDEFINED
         tracking URL: http://Master:8088/proxy/application_1490070587984_0007/
         user: test1
17/03/20 22:06:48 INFO yarn.Client: Application report for application_1490070587984_0007 (state: ACCEPTED)
17/03/20 22:06:49 INFO yarn.Client: Application report for application_1490070587984_0007 (state: ACCEPTED)
17/03/20 22:06:50 INFO yarn.Client: Application report for application_1490070587984_0007 (state: ACCEPTED)
17/03/20 22:06:51 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:51 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 192.168.169.128
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1490072789938
         final status: UNDEFINED
         tracking URL: http://Master:8088/proxy/application_1490070587984_0007/
         user: test1
17/03/20 22:06:52 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:53 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:54 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:55 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:56 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:57 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:58 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:06:59 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:07:00 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:07:01 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:07:02 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:07:03 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:07:04 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:07:05 INFO yarn.Client: Application report for application_1490070587984_0007 (state: RUNNING)
17/03/20 22:07:06 INFO yarn.Client: Application report for application_1490070587984_0007 (state: FINISHED)
17/03/20 22:07:06 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 192.168.169.128
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1490072789938
         final status: FAILED
         tracking URL: http://Master:8088/proxy/application_1490070587984_0007/
         user: test1
Exception in thread "main" org.apache.spark.SparkException: Application application_1490070587984_0007 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
        at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1178)
        at org.apache.spark.deploy.yarn.Client.main(Client.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/03/20 22:07:06 INFO util.ShutdownHookManager: Shutdown hook called
17/03/20 22:07:06 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a244ed7e-5265-4a80-a97e-7cd4a83425c6

The stdout of node:

[2017-03-20 22:06:34,619] [1] [INFO ] [Microsoft.Spark.CSharp.Configuration.ConfigurationService] - ConfigurationService runMode is YARN
[2017-03-20 22:06:34,627] [1] [INFO ] [Microsoft.Spark.CSharp.Configuration.ConfigurationService+SparkCLRConfiguration] - CSharpBackend successfully read from environment variable CSHARPBACKEND_PORT
[2017-03-20 22:06:34,629] [1] [INFO ] [Microsoft.Spark.CSharp.Proxy.Ipc.SparkCLRIpcProxy] - CSharpBackend port number to be used in JvMBridge is 43932
[2017-03-20 22:06:40,973] [1] [INFO ] [Microsoft.Spark.CSharp.Core.SparkContext] - Parallelizing 300001 items to form RDD in the cluster with 3 partitions
[2017-03-20 22:06:42,496] [1] [INFO ] [Microsoft.Spark.CSharp.Core.RDD`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]] - Executing Map operation on RDD (preservesPartitioning=False)
[2017-03-20 22:06:42,498] [1] [INFO ] [Microsoft.Spark.CSharp.Core.RDD`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]] - Executing Reduce operation on RDD
[2017-03-20 22:06:45,675] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - (anonymous method approach) Pi is roughly 3.12157333333333.
[2017-03-20 22:06:45,676] [1] [INFO ] [Microsoft.Spark.CSharp.Core.RDD`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]] - Executing Map operation on RDD (preservesPartitioning=False)
[2017-03-20 22:06:45,676] [1] [INFO ] [Microsoft.Spark.CSharp.Core.RDD`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]] - Executing Reduce operation on RDD
[2017-03-20 22:06:47,161] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - (serialized class approach) Pi is roughly 3.13689333333333.
[2017-03-20 22:06:47,161] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - Completed calculating the value of Pi
[2017-03-20 22:06:47,161] [1] [INFO ] [Microsoft.Spark.CSharp.Core.SparkContext] - Stopping SparkContext
[2017-03-20 22:06:47,161] [1] [INFO ] [Microsoft.Spark.CSharp.Core.SparkContext] - Note that there might be error in Spark logs on the failure to delete userFiles directory under Spark temp directory (spark.local.dir config value in local mode)
[2017-03-20 22:06:47,161] [1] [INFO ] [Microsoft.Spark.CSharp.Core.SparkContext] - This error may be ignored for now. See https://issues.apache.org/jira/browse/SPARK-8333 for details

The last several lines of stderr :

17/03/20 22:06:47 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.169.129:36353
17/03/20 22:06:47 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
17/03/20 22:06:47 INFO yarn.YarnAllocator: Canceling requests for 0 executor container(s) to have a new desired total 0 executors.
17/03/20 22:06:47 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
17/03/20 22:06:47 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
17/03/20 22:06:47 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
17/03/20 22:06:47 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
17/03/20 22:06:47 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/20 22:06:47 INFO memory.MemoryStore: MemoryStore cleared
17/03/20 22:06:47 INFO storage.BlockManager: BlockManager stopped
17/03/20 22:06:47 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
17/03/20 22:06:47 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/20 22:06:47 INFO spark.SparkContext: Successfully stopped SparkContext
17/03/20 22:06:47 INFO csharp.CSharpBackend: Requesting to close all call back sockets.
17/03/20 22:06:47 INFO csharp.CSharpRunner: Closing CSharpBackend
17/03/20 22:06:47 INFO csharp.CSharpBackend: Requesting to close all call back sockets.
17/03/20 22:06:47 INFO csharp.CSharpRunner: Return CSharpBackend code 0
17/03/20 22:06:47 INFO csharp.Utils: Utils.exit() with status: 0, maxDelayMillis: 1000
17/03/20 22:06:47 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
17/03/20 22:06:47 INFO util.ShutdownHookManager: Shutdown hook called
17/03/20 22:06:47 INFO util.ShutdownHookManager: Deleting directory /usr/local/hadoop-2.7.3/tmp/nm-local-dir/usercache/test1/appcache/application_1490070587984_0007/spark-56003f30-fe1d-454f-ac80-4d372d428beb
skaarthik commented 7 years ago

In stderr, I see Driver requested a total number of 0 executor(s).. Try using --num-executors parameter in sparkclr-submit.

tfiasco commented 7 years ago

I used the command sparkclr-submit.sh --master yarn --deploy-mode cluster --num-executors 4 --executor-cores 2 --executor-memory 1G --conf spark.speculation=true --driver-memory 1G --exe SparkClrPi.exe /usr/lib/sparkclr/examples/Batch/pi/

The status is still FAILED, and stderr is still Driver requested a total number of 0 executor(s).

skaarthik commented 7 years ago

Are you able to successfully run Scala Pi example in your YARN cluster?

tfiasco commented 7 years ago

Yes, I tried spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster ./examples/jars/spark-examples_2.11-2.0.2.jar 10 which returned a succeeded final status.

skaarthik commented 7 years ago

Do you see anything in stderr file when running Scala example?

tfiasco commented 7 years ago

The last lines in stderr of Scala example are:

17/03/21 16:01:01 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 3.468 s
17/03/21 16:01:01 INFO cluster.YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/03/21 16:01:01 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 4.293689 s
17/03/21 16:01:01 INFO server.ServerConnector: Stopped ServerConnector@7a6b222f{HTTP/1.1}{0.0.0.0:0}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@181edc4f{/stages/stage/kill,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7b9d9a6a{/api,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@204b4ba5{/,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@cbb79a2{/static,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3debd17f{/executors/threadDump/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@36cdb80e{/executors/threadDump,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@30e397ff{/executors/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3490ae3e{/executors,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5f3a82fa{/environment/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@39a7f1fc{/environment,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1c73109b{/storage/rdd/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@75b3de55{/storage/rdd,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7142e73d{/storage/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@53eec3fa{/storage,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@66a1e111{/stages/pool/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2a65c212{/stages/pool,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@758af09a{/stages/stage/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7d190c6b{/stages/stage,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@11bbb671{/stages/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7215b4a5{/stages,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@59da282a{/jobs/job/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2d382edb{/jobs/job,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@780369c7{/jobs/json,null,UNAVAILABLE}
17/03/21 16:01:01 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7d3adbd7{/jobs,null,UNAVAILABLE}
17/03/21 16:01:01 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.169.128:42487
17/03/21 16:01:01 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
17/03/21 16:01:01 INFO yarn.YarnAllocator: Canceling requests for 0 executor container(s) to have a new desired total 0 executors.
17/03/21 16:01:01 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
17/03/21 16:01:01 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
17/03/21 16:01:01 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
17/03/21 16:01:01 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
17/03/21 16:01:01 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/21 16:01:01 INFO memory.MemoryStore: MemoryStore cleared
17/03/21 16:01:01 INFO storage.BlockManager: BlockManager stopped
17/03/21 16:01:01 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
17/03/21 16:01:01 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/21 16:01:01 INFO spark.SparkContext: Successfully stopped SparkContext
17/03/21 16:01:01 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
17/03/21 16:01:01 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
17/03/21 16:01:01 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
17/03/21 16:01:01 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://Master:9000/user/test1/.sparkStaging/application_1490137059379_0001
17/03/21 16:01:01 INFO util.ShutdownHookManager: Shutdown hook called
17/03/21 16:01:01 INFO util.ShutdownHookManager: Deleting directory /usr/local/hadoop-2.7.3/tmp/nm-local-dir/usercache/test1/appcache/application_1490137059379_0001/spark-3f36dc68-1ebe-4e3c-a868-c7b7cdf26e44

The last lines in stderr of Mobius example are:

17/03/21 16:03:55 INFO scheduler.DAGScheduler: ResultStage 1 (collect at PythonRDD.scala:453) finished in 2.275 s
17/03/21 16:03:55 INFO scheduler.DAGScheduler: Job 1 finished: collect at PythonRDD.scala:453, took 2.279892 s
17/03/21 16:03:55 INFO server.ServerConnector: Stopped ServerConnector@5d3a199f{HTTP/1.1}{0.0.0.0:0}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@60c3701b{/stages/stage/kill,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4c18de8d{/api,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@9458eb2{/,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3bb2acf{/static,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@273168d5{/executors/threadDump/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5452294e{/executors/threadDump,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1a7200da{/executors/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@56ee4814{/executors,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@34e1bda{/environment/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@3f4e341e{/environment,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@68049a81{/storage/rdd/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5881be91{/storage/rdd,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@565d4c29{/storage/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@162836a1{/storage,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1ae55a0a{/stages/pool/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@d407249{/stages/pool,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2edddd10{/stages/stage/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1373f649{/stages/stage,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7bed9bf0{/stages/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@569eb406{/stages,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@787429d{/jobs/job/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4cd48440{/jobs/job,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@2d800bde{/jobs/json,null,UNAVAILABLE}
17/03/21 16:03:55 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@21fe16d5{/jobs,null,UNAVAILABLE}
17/03/21 16:03:55 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.169.129:36122
17/03/21 16:03:56 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
17/03/21 16:03:56 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
17/03/21 16:03:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
17/03/21 16:03:56 INFO yarn.YarnAllocator: Canceling requests for 0 executor container(s) to have a new desired total 0 executors.
17/03/21 16:03:56 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
17/03/21 16:03:56 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
17/03/21 16:03:56 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/21 16:03:56 INFO memory.MemoryStore: MemoryStore cleared
17/03/21 16:03:56 INFO storage.BlockManager: BlockManager stopped
17/03/21 16:03:56 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
17/03/21 16:03:56 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/21 16:03:56 INFO spark.SparkContext: Successfully stopped SparkContext
17/03/21 16:03:56 INFO csharp.CSharpBackend: Requesting to close all call back sockets.
17/03/21 16:03:56 INFO csharp.CSharpRunner: Closing CSharpBackend
17/03/21 16:03:56 INFO csharp.CSharpBackend: Requesting to close all call back sockets.
17/03/21 16:03:56 INFO csharp.CSharpRunner: Return CSharpBackend code 0
17/03/21 16:03:56 INFO csharp.Utils: Utils.exit() with status: 0, maxDelayMillis: 1000
17/03/21 16:03:56 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
17/03/21 16:03:56 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Shutdown hook called before final status was reported.)
17/03/21 16:03:56 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
17/03/21 16:03:56 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://Master:9000/user/test1/.sparkStaging/application_1490137059379_0002
17/03/21 16:03:57 INFO util.ShutdownHookManager: Shutdown hook called
17/03/21 16:03:57 INFO util.ShutdownHookManager: Deleting directory /usr/local/hadoop-2.7.3/tmp/nm-local-dir/usercache/test1/appcache/application_1490137059379_0002/spark-066969d3-fcf9-44b3-8d3e-11801345971b
skaarthik commented 7 years ago

Looks like some exception is thrown after successfully calculating Pi. You might be able to find the exception details in YARN logs. You could also try running the app in client mode to get the exception details in your command prompt.

tfiasco commented 7 years ago

The SparkClrPi example returns a SUCCEEDED final status in yarn client mode, but in yarn cluster mode it is FAILED.

skaarthik commented 7 years ago

I guess the timeout to exit app master gracefully is not enough.

If you do not mind compiling Mobius code, you could increase the timeout from 1 second to something like 5 or 10 seconds and build Mobius jar file to use. This will most likely solve the problem.

im281 commented 6 years ago

Did this solve the problem and is it in the latest? I am getting the same error

gokl02 commented 4 years ago

U need to explicitly call spark context.stop(). By default shutdownhookmanager will stop. If it is exceedingly 10s. Interrupted exception will throw eve though if job succeeded before