Launch cluster with AMI that already has spark

nchammas / flintrock

A command-line tool for launching Apache Spark clusters.

Apache License 2.0

636 stars 116 forks source link

Launch cluster with AMI that already has spark #308

Closed pwsiegel closed 3 years ago

pwsiegel commented 4 years ago

Flintrock version: 0.11.0
Python version: 3.7.5
OS: OSX

I tend to use Flintrock with custom builds of spark. Normally I host the build somewhere and use the download-source configuration parameter in the flintrock config to link to it, and this works fine. But I thought it might be convenient to create an AMI (starting with Amazon Linux 2) with Java and Spark already installed and set install-spark to False in the flintrock config, and so I gave this a try. The cluster launched as expected, but when I tried to start a spark session I got:

WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master [IP-ADDRESS]

It retried for awhile, and eventually errored out. Is there a known way to get this to work?

One last note: when I allow Flintrock to install Spark there is normally a message at the end of the launch process which says Configuring Spark master.... I didn't get that when I set install-spark to False.

Thank you!

nchammas commented 4 years ago

It's probably because Spark isn't being configured with the addresses of the nodes in your cluster (which would happen as part of the "Configuring Spark master" step).

Off the top of my head, I think to get this to work you'd need to set install-spark to True and then update the install() method of the Spark service class to skip trying to download Spark. It's a hack but will get you going.

A more proper fix would perhaps be to add a new download-destination configuration and have install() skip the download if the destination is already populated. That way you can keep install-spark enabled and Flintrock will skip the download but still do the necessary configuration.

pwsiegel commented 4 years ago

Got it, thank you. I'm not sure if/when I'll have the time, but would you consider a PR in the direction of your second suggestion?

And one last question, for my own understanding: what is the intended use of the install-spark parameter?

nchammas commented 4 years ago

Yes, I would consider a PR along those lines.

If you just want a cluster with HDFS, or plan to do the Spark config yourself, then setting install-spark to False is what you want.

In other words, you can make things work with Flintrock the way it is and set install-spark to False, but you also need to call flintrock run-command ... and do the Spark configuration yourself.

The download-destination idea would make it easier to separate downloading Spark from configuring Spark. That would be better for users in your situation, since Flintrock can still do the configuration and not put that on the user to do. Right now you have to either enable both download and configure together, or disable both together.

pwsiegel commented 4 years ago

That makes sense. I might have a little time to tinker this week; I'll be back if I make sufficient progress. Thanks for your help, and for Flintrock itself!

nchammas commented 3 years ago

Let's continue this discussion over on #237, which I think captures the same need expressed here.