nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
638 stars 116 forks source link

Look for Spark's make-distribution.sh script in its new location (plus its current one) #93

Closed BenFradet closed 8 years ago

BenFradet commented 8 years ago

fixes #91

nchammas commented 8 years ago

Looks good to me.

Have you tested this both against a recent Spark commit, like 99b7187c2dce4c73829b9b32de80b02a053763cc, as well as an older Spark commit from before the move of make-distribution.sh, like f19228eed89cf8e22a07a7ef7f37a5f6f8a3d455?

BenFradet commented 8 years ago

I tested the script itself but not as a part of flintrock, should I?

BenFradet commented 8 years ago

While trying to test my change, I'm getting:

paramiko.ssh_exception.SSHException: not a valid EC private key file

despite having a properly formatted .pem file.

Do you have any idea what could be causing this?

nchammas commented 8 years ago

Hmm, I've never seen that error before. It seems to be ultimately coming from EC2?

Are you able to use that same private key file to log into EC2 instances outside of Flintrock?

BenFradet commented 8 years ago

Found my problem: the user was misconfigured.

I tested the change against today's commit: apache/spark@4eace4d384f0e12b4934019d8654b5e3886ddaef and the latest in the 1.6 branch: apache/spark@db4795a7eb1bac039e9e96237cf77e47ed76dde8

The build is correctly started. However, the spark core project won't compile (it might be because I'm using t2.micro instances).

nchammas commented 8 years ago

Yeah, to build Spark in a reasonable amount of time you'd need at least m3.xlarge instances.

Thanks for contributing this patch and testing it out! I'll merge this in.

nchammas commented 8 years ago

Hmm, actually I'm having trouble getting this to work against the latest commit of Spark. I get this error:

<snipped>
+ VERSION='[ERROR] Re-run Maven using the -X switch to enable full debug logging.'

Do you get the same error? This may be a subtle change on Spark's side that we have to handle.

BenFradet commented 8 years ago

Trying on m3.xlarge I get the same error as you which I didn't get on t2.micro, weird. I'll keep investigating and keep you posted.

BenFradet commented 8 years ago

after cding into the dev dir before calling make-distribution.sh, I get the following when trying to compile:

[info] Error occurred during initialization of VM [info] java.lang.Error: Properties init: Could not determine current working directory. [info] at java.lang.System.initProperties(Native Method) [info] at java.lang.System.initializeSystemClass(System.java:1166)

BenFradet commented 8 years ago

Apparently the parallel build option (-T 1C) is causing it to fail.

The first maven instruction, which is: /tmp/spark/build/mvn help:evaluate -X -Dexpression=project.version -T 1C -Phadoop-2.6, fails with:

[ERROR] java.util.concurrent.ExecutionException: java.lang.NullPointerException java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NullPointerException at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder.multiThreadedProjectTaskSegmentBuild(MultiThreadedBuilder.java:170) at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder.build(MultiThreadedBuilder.java:91) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288) at org.apache.maven.cli.MavenCli.main(MavenCli.java:199) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder.multiThreadedProjectTaskSegmentBuild(MultiThreadedBuilder.java:166) ... 16 more Caused by: java.lang.NullPointerException at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185) at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Also, there is a warning regarding parallel execution which might be causing the failure:

[WARNING] [WARNING] * Your build is requesting parallel execution, but project [WARNING] * contains the following plugin(s) that have goals not marked [WARNING] * as @threadSafe to support parallel building. [WARNING] * While this /may/ work fine, please look for plugin updates [WARNING] * and/or request plugins be made thread-safe. [WARNING] * If reporting an issue, report it against the plugin in [WARNING] * question, not against maven-core [WARNING] * [WARNING] The following goals are not marked @threadSafe in Spark Project Parent POM: [WARNING] org.apache.maven.plugins:maven-help-plugin:2.2:evaluate [WARNING] *****

Are you ok with removing it?

BenFradet commented 8 years ago

Btw, I tested the script as a part of flintrock with the two previously mentioned commits and it worked in both cases (having removed -T 1C from the 2.0 script).

nchammas commented 8 years ago

I think something else is going on.

If I clone Spark locally and run

./dev/make-distribution.sh -T 1C -Phadoop-2.6

it works fine against the latest commit. This smells like something related to the shell environment over SSH.

Interestingly, it seems that the commit that moved make-distribution.sh (0eea12a3d956b54bbbd73d21b296868852a04494) is not responsible for the problem we are seeing, since I was just able to launch a cluster at that commit.

I think a good next step would be to try to find the exact Spark commit that breaks this. I'll poke around more myself later this week to try to find it.

Sorry this turned into more than a simple change @BenFradet!

I'd really like to keep the -T 1C working since people building Spark during cluster launches will really benefit from the shorter build times. It can be the difference between a 30 minute build and a 10 minute or even shorter build, depending on how many cores your cluster instances have.

BenFradet commented 8 years ago

For me apache/spark@4eace4d fails to build both locally and remotely with ./dev/make-distribution.sh -T 1C -Phadoop-2.6

I'll investigate later commits.

nchammas commented 8 years ago

I found it. This is the commit that breaks -T 1C: https://github.com/apache/spark/commit/6ca990fb366cf68cd9d5afb433725d28f07e51a0

Source PR: https://github.com/apache/spark/pull/11178

BenFradet commented 8 years ago

Mmh interesting

nchammas commented 8 years ago

Revisiting the error message you posted above @BenFradet, it looks like some project changes are interfering with the parallel build option, as you pointed out. :disappointed: That PR I linked to is probably just where this change was introduced.

So I now agree with your earlier suggestion: The simplest thing to do is to simply remove the -T 1C.

BenFradet commented 8 years ago

ok, will do