I am using the Python example python/ml/kmeans_example.py. This file has a hard-coded path 'data/mllib/sample_kmeans_data.txt'.
Now when I run ./bin/spark-submit --master lambda://test examples/src/main/python/ml/kmeans_example.py under the driver folder, Spark's log shows java.io.FileNotFoundException: File file:/home/ec2-user/driver/data/mllib/sample_kmeans_data.txt does not exist.
I was told that data file location string needs to be consistent between Lambda and Spark. Your Lambda code expects data file to be somewhere under /tmp/lambda, I looked at what actually was under /tmp/lambda. There was a spark folder. So my work-around was to create a temporary /tmp/lambda/spark/data/mllib/ under my EC2, move my data file there, and then point to that file in spark.read. Specifically I changed line 42 to
I am using the Python example python/ml/kmeans_example.py. This file has a hard-coded path 'data/mllib/sample_kmeans_data.txt'.
Now when I run
./bin/spark-submit --master lambda://test examples/src/main/python/ml/kmeans_example.py
under the driver folder, Spark's log showsjava.io.FileNotFoundException: File file:/home/ec2-user/driver/data/mllib/sample_kmeans_data.txt does not exist
.I was told that data file location string needs to be consistent between Lambda and Spark. Your Lambda code expects data file to be somewhere under
/tmp/lambda
, I looked at what actually was under/tmp/lambda
. There was aspark
folder. So my work-around was to create a temporary/tmp/lambda/spark/data/mllib/
under my EC2, move my data file there, and then point to that file inspark.read
. Specifically I changed line 42 toAnd then it worked fine.
I suppose that part or many Python files has this problem, so it can be a barrier for python users.