qubole / spark-on-lambda

Apache Spark on AWS Lambda
Apache License 2.0
151 stars 32 forks source link

How to make commands execute on Lambda #3

Open faromero opened 6 years ago

faromero commented 6 years ago

How do I run commands from the spark-shell so that they are executed on Lambda? Right now, the commands are being executed locally on my machine, but I would like Lambda to be the backend.

I am running the following command to start the shell (which does start successfully): bin/spark-shell --conf spark.hadoop.fs.s3n.awsAccessKeyId=<my-key> --conf spark.hadoop.fs.s3n.awsSecretAccessKey=<my-secret-key> --conf spark.shuffle.s3.bucket=s3://<my-bucket> --conf spark.lambda.function.name=spark-lambda --conf spark.lambda.s3.bucket=s3://<my-bucket>/lambda --conf spark.lambda.spark.software.version=149

I have created the function spark-lambda to be the contents of spark-lambda-os.py and have given it S3 and EC2 permissions. In addition, the S3 bucket <my-bucket>/lambda has the package spark-lambda-149.zip which was put together by the spark-lambda script. Is there anything else I need to do to have it execute on Lambda?

venkata91 commented 6 years ago

spark-shell (Spark Driver) has to be brought in an AWS EC2 or ECS container which is in the same VPC as the lambda function, also you need to create the lambda function through AWS console. Once thats done, you should see executors connecting back to driver thats basically it and you should be able to run you spark commands over AWS Lambda. May be it would be better if you can share your email or something else.

faromero commented 6 years ago

Ok, I created the function spark-lambda through the AWS console which has the code from spark-lambda-os.py, and this function is in the same VPC and subnet as the EC2 instance that is calling it. It also has the same security group, and has EC2 and S3 permissions. I see that when I launch a job via spark-submit, the BlockManager outputs some information about LAMBDA and the DiskBlockManager attempts to create directories in the S3 bucket I provided. However, it still seems to be executing locally (no Lambda invocations of the spark-lambda function). Is there anything else I may be doing incorrectly?

My email can be found in my Github profile :). Thanks again for all your help in getting this up and running on Lambda!

venkata91 commented 6 years ago

I think the issue is LambdaSchedulerBackend is not created, you have to pass another config --conf spark.master lambda:// or something like that. This is the code (spark-on-lambda/core/src/main/scala/org/apache/spark/scheduler/cluster/LambdaSchedulerBackend.scala) which talks to AWS Lambda to spawn executors. Look for log line "Created LambdaExecutorService"

faromero commented 6 years ago

Awesome, yes! Adding --master lambda:// fixes that problem. One more issue (I hope): when it finally tries to launch a job, I get the following error: WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources. The Lambdas I have configured have 3GB of RAM (the maximum allowed by AWS), so what could be causing the scheduler to think there aren't enough resources available? The zip itself is a little under 500MB as well.

venkata91 commented 6 years ago

Nice. I think the executors haven't still registered with the Spark Driver. Please check the cloudwatch logs, that would have some info I believe.

faromero commented 6 years ago

Unfortunately, there still aren't any invocations of spark-lambda, so there are no CloudWatch logs available to check. I also noticed that even though it says it's created directories in the S3 bucket I provided, those tmp directories aren't there...

DimitarKum commented 6 years ago

Hi venkata91, faromero, I've been trying to set spark-on-lambda and have basically had the exact same issues as in this thread. Thanks for adding extra documentation in https://github.com/faromero/spark-on-lambda, faromero. After setting up VPC properly and adding all the extra --configs when starting the spark shell I am stuck on the message pointed above whenever I run a job: "WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources" I can also access the spark UI via browser and under the executors tab it says there's only 1 executor (which I assume is the EC2 instance itself). I have also yet to see my lambda receive invocations. Did you menage to fix the issue?

venkata91 commented 6 years ago

Hey DimitarKum,

Thanks for trying this out. Its pending on my side to resolve this issue. I have to update the documentation. Last time when I discussed with faromero, these are the things which I made to make it work, something has changed on AWS side from then (time when we developed this one) to now.

VPC stuff - This is what I did.

1. Created a VPC - venkat-vpc
2. Created a public subnet - venkat-public-subnet with in the above VPC. This automatically comes up with an Internet gateway. Check routes you'll see something like igw-*
3. Created a private subnet - venkat-private-subnet with in the above VPC as well. 
4. Then created a NAT gateway for the above private subnet. In the route table, created a route for the NAT gateway (nat-*) to the Internet gateway available in the public subnet (igw-*).

Bring up the lambda function inside the VPC (venkat-vpc) with in the private subnet (venkat-private-subnet) and an EC2 instance where you'll run spark-submit or spark-shell in the same VPC (venkat-vpc) but in the public subnet (venkat-public-subnet). Also create a security group with inbound rules (access to TCP, UDP from the same security group) and outbound rules (All traffic (0.0.0.0/0). Associate both the EC2 instance (Spark Driver node) as well as the lambda function (Spark executor).

https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Scenario2.html

Above doc was useful, the picture in the documentation helped in setting up VPC and subnets.

This works for me. Hope it works for you as well.

Regarding passing the input and output files, use S3 locations I don't remember using local files. But for application jars and other dependent jars, those can be passed from the local machine (file:///).

Can you please try the above steps to set up both of your EC2 instance as well as lambda function? Please let us know if you stuck somewhere, happy to help!