Python example file's data location does not meet Lambda's expectation

I am using the Python example python/ml/kmeans_example.py. This file has a hard-coded path 'data/mllib/sample_kmeans_data.txt'.

Now when I run ./bin/spark-submit --master lambda://test examples/src/main/python/ml/kmeans_example.py under the driver folder, Spark's log shows java.io.FileNotFoundException: File file:/home/ec2-user/driver/data/mllib/sample_kmeans_data.txt does not exist.

I was told that data file location string needs to be consistent between Lambda and Spark. Your Lambda code expects data file to be somewhere under /tmp/lambda, I looked at what actually was under /tmp/lambda. There was a spark folder. So my work-around was to create a temporary /tmp/lambda/spark/data/mllib/ under my EC2, move my data file there, and then point to that file in spark.read. Specifically I changed line 42 to

    import os
    data_folder = '/home/ec2-user/driver/data/mllib'
    lambda_folder = '/tmp/lambda/spark/data/mllib'
    filename = 'sample_kmeans_data.txt'
    os.system('mkdir -p ' + lambda_folder)
    os.system('cp {}/{} {}/{}'.format(data_folder, filename, lambda_folder, filename))
    dataset = spark.read.format("libsvm").load('{}/{}'.format(lambda_folder, filename))

And then it worked fine.

I suppose that part or many Python files has this problem, so it can be a barrier for python users.

qubole / spark-on-lambda

Python example file's data location does not meet Lambda's expectation #7