pishen / sbt-lighter

SBT plugin for Apache Spark on AWS EMR
Apache License 2.0
57 stars 15 forks source link

help debugging #13

Closed lolaclinton closed 7 years ago

lolaclinton commented 7 years ago

Sorry to keep asking questions .. I tried to run my job and it crashed. I'm not sure how to debug it with this error report

last *:sparkMonitor [info] Found cluster j-37KJST2B19MM3, start monitoring. java.lang.RuntimeException: Cluster terminated with abnormal step. at scala.sys.package$.error(package.scala:27) at sbtemrspark.EmrSparkPlugin$$anonfun$baseSettings$26.checkStatus$1(EmrSparkPlugin.scala:247) at sbtemrspark.EmrSparkPlugin$$anonfun$baseSettings$26.apply(EmrSparkPlugin.scala:257) at sbtemrspark.EmrSparkPlugin$$anonfun$baseSettings$26.apply(EmrSparkPlugin.scala:222) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40) at sbt.std.Transform$$anon$4.work(System.scala:63) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17) at sbt.Execute.work(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159) at sbt.CompletionService$$anon$2.call(CompletionService.scala:28) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

lolaclinton commented 7 years ago

I'm guessing this has to do with s3 credentials. Is there a way to save them outside the code or configuration? So if someone else uses my code it will use their credentials for the EMR job? Also, I can't find the log files for this cluster: j-37KJST2B19MM3. Which bucket should it be in? So I redirected the log files as you show in your documentation but there is no info there ...

pishen commented 7 years ago

For the credentials, you can use sparkInstanceRole := ... and make sure the role it pointed to (default is EMR_EC2_DefaultRole) has the permission on reading your S3 bucket.

For the logging, follow the setting at https://github.com/pishen/sbt-emr-spark#to-set-the-s3-logging-folder-for-emr-cluster And after the log is written to the S3 location, you may find your Spark logs at a sub-folder similar to:

j-xxxxxxxxxxxx/containers/application_xxxxxxxxxxxxx_0001/container_xxxxxxxxxxxxx_0001_01_000001/stderr.gz
lolaclinton commented 7 years ago

Thanks for the advice. It seems like the cluster terminates the moment it crashes. So while I can access the log I can't access the errors on the cluster itself. Is there a setting I can set to do that? I tried starting the cluster in advance but the behavior was the same, termination.

pishen commented 7 years ago

withKeepJobFlowAliveWhenNoSteps is true when you create the cluster in advance https://github.com/pishen/sbt-emr-spark/blob/master/src/main/scala/EmrSparkPlugin.scala#L137 withActionOnFailure is ActionOnFailure.CONTINUE by default https://github.com/pishen/sbt-emr-spark/blob/master/src/main/scala/EmrSparkPlugin.scala#L269

I'm not sure how can it still terminate automatically with these settings. If you just do nothing and throw a RuntimeException from your job to cause the job failed, will it terminate the cluster automatically as well?

lolaclinton commented 7 years ago

Well it seems to stop now. I manually added withActionOnFailure(true). Connecting to the cluster didn't help much though. I'm seeing a Hadoop exit code 15, which is very hard to decipher. Do you have any ideas how to get more information? On stackoverflow I'm seeing people recommending peeking at the Yarn logs. Do they exist in this situation? Thanks :(

lolaclinton commented 7 years ago

So managed to get there by logging into the master :) Seems the bug wasn't mine - had to do with this: https://github.com/aws/aws-sdk-java/issues/1094 FYI, would be wonderful if your system had a way to pull this data out easily. Glad everything is working though :)

pishen commented 7 years ago

@lolaclinton I'm not sure how did you see the error log of IllegalAccessError? If you have a clear instruction on how to get the log, maybe we can figure out how to get it programmatically.

pishen commented 7 years ago

@lolaclinton The issue seems to be fixed by EMR 5.8.0? If you still meet a problem, feel free to tell me :)