qubole / spark-on-lambda

Apache Spark on AWS Lambda
Apache License 2.0
151 stars 32 forks source link

Compiling #2

Open saj9191 opened 6 years ago

saj9191 commented 6 years ago

Hello, I'm trying to install spark on lambda. When I run

./dev/make-distribution.sh --name spark-lambda-2.1.0 --tgz -Phive -Phadoop-2.7 -Dhadoop.version=2.6.0-qds-0.4.13 -DskipTests

The Project Launcher fails and I get the following error.

[ERROR] Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.1.0: Failure to find com.hadoop.gplcompression:hadoop-lzo:jar:0.4.19 in https://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]

I tried to explicitly add hadoop-lzo as a dependency in the launcher pom.xml, but I still get the same error. Is there something I need to download or change to get this to work?

Thanks!

venkata91 commented 6 years ago

Hi saj9191,

It seems like something changed in our side where we keep the maven artifacts, we'll fix it and update you here. Thanks for trying it out. Sorry for the inconvenience.

faromero commented 6 years ago

I am also having the same issue (also tried adding hadoop-lzo dependency manually to pom.xml with no success). Have there been any updates on resolving this issue?

venkata91 commented 6 years ago

We were also hitting this issue recently. I will get back with a fix soon and post it here. Thanks for taking your time to try it out.

faromero commented 6 years ago

I believe I have found a solution: In spark-on-lambda/common/network-common/pom.xml, add the following dependency (as suggested previously):

<dependency>
  <groupId>com.hadoop.gplcompression</groupId>
  <artifactId>hadoop-lzo</artifactId>
  <version>0.4.19</version>
</dependency>

Then, in spark-on-lambda/pom.xml, add the following repository (which "houses" hadoop-lzo):

<repository>
  <id>twitter</id>
  <name>Twitter Repository</name>
  <url>http://maven.twttr.com</url>
</repository>

After this, I ran the make-distribution.sh command from your README and was able to build it all the way through.

venkata91 commented 6 years ago

Nice workaround! Let me also try it and update it.

venkata91 commented 6 years ago

Also may I know your use case for which you are trying it out or do you want to just try it out?

faromero commented 6 years ago

Thanks for working to update it!

We are working on a research project associated with using Lambda for what we call "interactive massively parallel" applications, and wanted to compare Spark-on-Lambda to current state-of-the-art, as well as our work!

By the way, from your blog post, do you have the data available that you use for sorting 100GB in under 10 minutes?

venkata91 commented 6 years ago

Interesting! Can you please elaborate a bit more on that? Btw the data is generated using Teragen utility from https://github.com/ehiggs/spark-terasort which you can use to generate the data.

faromero commented 6 years ago

You can view our work here: we call it gg, and while it was originally intended for compilation, it now supports general purpose applications (as simple as sorting and as complex as video encoding). Let me know if you have any questions about it (can be in a different forum instead of this issue thread)

I will try to run your sorting example and let you know if I have any issues!

venkata91 commented 6 years ago

Another easier workaround is to remove the pom.xml additions basically reverting the commit "Fix pom.xml to have the other Qubole repository location having 2.6.0... (2ca6c68ed5)"

Build your package using this command - ./dev/make-distribution.sh --name spark-lambda-2.1.0 --tgz -Phive -Phadoop-2.7 -DskipTests

And finally add the below jars to classpath before starting spark-shell

1. wget http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
2. wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar

Refer here - https://markobigdata.com/2017/04/23/manipulating-files-from-s3-with-apache-spark/

webroboteu commented 5 years ago

hi, venkata91, I wrote you an email. I'm looking for an advisor for my startup. It is a spark-based web scraping service. The idea is to use this serverless computation but I'm having problems. As soon as you have time I would like to deepen it.