thunder-project / thunder

scalable analysis of images and time series
http://thunder-project.org
Apache License 2.0
814 stars 184 forks source link

thunder-ec2 crashing during launch with spark 1.5.2 and spark 1.6.1 #272

Open nerduno opened 8 years ago

nerduno commented 8 years ago

As of yesterday, thunder-ec2 crashes when launching a cluster with the error below. When logging into the master of the incompletely-launched cluster, /root/spark is largely empty with just an incomplete conf folder. I'm currently researching what is causing this problem, but I'm curious if others are having this problem?

thunder-ec2 -k <keyname> -i <keyfile> -t r3.8xlarge -s 1 launch test_launch --ganglia --additional-security-group=<securitygroup>
...
Cluster is now in 'ssh-ready' state. Waited 620 seconds.

    [Generating cluster's SSH key on master]
    [success]
    [Transferring cluster's SSH key to slaves]
    [success]
aatest: git clone https://github.com/amplab/spark-ec2 -b branch-1.5 spark-ec2
    [Deploying files to master]
    [success]
    [Installing Spark (may take several minutes)]
    [success]
    [Downloading Anaconda]
    [success]
    [Installing Anaconda]
    [success]
    [Updating Anaconda libraries]
    [success]
    [Copying Anaconda to workers]
    [success]
    [Installing Thunder]
    [success]
    [Configuring Spark for Thunder]
    [SSH failure, returning error]
Traceback (most recent call last):
  File "/Users/andalman/anaconda/lib/python2.7/site-packages/thunder/utils/ec2.py", line 544, in <module>
    configure_spark(master, opts)
  File "/Users/andalman/anaconda/lib/python2.7/site-packages/thunder/utils/ec2.py", line 247, in configure_spark
    ssh(master, opts, "sed 's/log4j.rootCategory=INFO/log4j.rootCategory=ERROR/g' "
  File "/Users/andalman/anaconda/lib/python2.7/site-packages/thunder/utils/ec2.py", line 300, in ssh
    raise Exception(stdout)
Exception: sed: can't read /root/spark/conf/log4j.properties.template: No such file or directory
npyoung commented 8 years ago

Thunder version? Latest Spark locally? I've had some cluster startup issues recently which came down to having stale Spark with the current Thunder release on pypi.

nerduno commented 8 years ago

Local thunder (installed using pip) is 0.6.0 Local spark is spark-1.6.1-bin-hadoop2.4.

Have you tried booting a cluster today? Would you mind attempting?

Thanks.

On Thu, Apr 7, 2016 at 3:47 PM NP Young notifications@github.com wrote:

Thunder version? Latest Spark locally? I've had some cluster startup issues recently which came down to having stale Spark with the current Thunder release on pypi.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/thunder-project/thunder/issues/272#issuecomment-207124331

npyoung commented 8 years ago

Similar error, getting Exception: File or directory /root/thunder/thunder/utils/data/ doesn't exist!

nerduno commented 8 years ago

@npyoung not sure if that is a related error or not. I consistently get the identical error message to what I pasted above.

I may have tracked the problem down to the portion of spark-ec2/setup.sh that installs spark. The problem appears to be that the s3 hosting of the spark distribution is missing. Here is the relevant bit of the output from spark-ec2-setup.sh which is run by thunder ec2.py:

(Note that before tracking this down I downgraded to spark 1.5.2 in hopes that might solve the problem)

[timing] scala init:  00h 00m 10s
Initializing spark
--2016-04-08 00:40:34--  http://s3.amazonaws.com/spark-related-packages/spark-1.5.2-bin-hadoop1.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.13.160
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.13.160|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-04-08 00:40:34 ERROR 404: Not Found.

ERROR: Unknown Spark version
spark/init.sh: line 137: return: -1: invalid option
return: usage: return [n]
Unpacking Spark
tar (child): spark-*.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
rm: cannot remove `spark-*.tgz': No such file or directory
mv: missing destination file operand after `spark'
Try `mv --help' for more information.
[timing] spark init:  00h 00m 00s
Initializing shark
nerduno commented 8 years ago

The spark installation issue for 1.6.1 is similar but slightly different:

[timing] scala init:  00h 00m 14s
Initializing spark
--2016-04-08 01:21:31--  http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop1.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.34.104
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.34.104|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277258240 (264M) [application/x-compressed]
Saving to: ‘spark-1.6.1-bin-hadoop1.tgz’

100%[=========================================================================================================================================>] 277,258,240 54.3MB/s   in 5.1s   

2016-04-08 01:21:36 (52.2 MB/s) - ‘spark-1.6.1-bin-hadoop1.tgz’ saved [277258240/277258240]

Unpacking Spark

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
mv: missing destination file operand after `spark'
Try `mv --help' for more information.
[timing] spark init:  00h 00m 05s
Initializing shark
nerduno commented 8 years ago

After installing spark version 1.6.0 everything works as expected.

My assessment of the problem that sometime in the past few days the s3-hosted copy of spark 1.6.1 was corrupted such that the spark-ec2/setup.sh script could not unpack it. This caused the thunder-ec2 script to begin crashing.

By pure coincidence my attempt to solve the problem by reverting to spark 1.5.2 also failed because someone forgot to host this version of spark on s3.

By switching to spark 1.6.0, I can successfully launch clusters using thunder-ec2.

Resolving this issue will require fixing the s3 hosting of spark variants 1.6.1 and 1.5.2.

freeman-lab commented 8 years ago

@nerduno @npyoung nice job tracking this down! I imagine the S3 version will get fixed soon, but you might want to post to the Spark mailing lists.

This certainly sounds frustrating, and this kind of thing is one of the reasons that as of thunder 1.0.0, which is now published to PyPi, we'll no longer include thunder-ec2. You can certainly still use it if you stick with 0.6.x, modulu the issue above, of course!

But moving forward, I'd rather work on a separate standalone utility that offers similar functionality, but is not so tightly coupled to thunder, and doesn't wrap spark-ec2, which was a huge maintenance burden.

All thunder-ec2 really did is give a nicer CLI around spark-ec2 and installed a more modern scientific Python — the part that installed thunder itself was pretty trivial. My current plan is to try writing something that does just this on top of flintrock, which among other things is way, way faster than -spark-ec2. Thoughts?

broxtronix commented 8 years ago

I believe that they just ran into this corrupt Spark package problem in flintrock: https://github.com/nchammas/flintrock/issues/101

I've been using flintrock for a few months now. It's mature enough for everyday use and surpasses the spark-ec2 scripts in several respects. Mostly I have been enjoying how fast it is, but it has also been more customizable. I'd definitely recommend migrating to it.

The cluster it sets up has Spark and Hadoop but not much else, but a quick script that installs Thunder and a few key Python packages would make it easy for users to customize a vanilla flintrock cluster to their needs. Among other things, the default root volume size on flintrock clusters is greater than the 10GB limit of the spark-ec2 script, so installing additional packages is something you can do without worrying about running out of hard drive space.

nerduno commented 8 years ago

Yes, I very much like the solution of having a script that installs anaconda, thunder, etc on top of a flintrock cluster. I haven't tried flintrock, but it is high on my to-do list primarily because of the default root volume size problem @broxtronix mentioned -- very frustrating. Related to this do know of way to force spark to clean up temporary files on the slaves?

npyoung commented 8 years ago

I'd just pssh -h ~/spark-ec2/slaves "rm -r /tmp/*". /tmp should just empty itself on reboot though.