mahmoudparsian / data-algorithms-book

MapReduce, Spark, Java, and Scala for Data Algorithms Book
http://mapreduce4hackers.com
Other
1.07k stars 665 forks source link

Issue submitting Spark Job from code when Spark Job is a Python program. #4

Open sehunley opened 8 years ago

sehunley commented 8 years ago

When submitting a Java or Scala program, everything works fine. When submitting a python program, it's gets to the ACCEPTED state and then stalls. It eventually times out, but it's not getting picked up to run. Is this interface just for Java/Scala programs/jobs or should it be able to submit PySpark/Python jobs as well?

I am trying to invoke the pi.py sample program that comes with Spark 1.6.0.

Below is the java program that I am testing with. I'm new to Spark so apologies for any "newbie" errors.

import org.apache.spark.SparkConf; import org.apache.spark.deploy.yarn.Client; import org.apache.spark.deploy.yarn.ClientArguments; import org.apache.hadoop.conf.Configuration; // import org.apache.log4j.Logger;

/**

Thanks,

-Scott

mahmoudparsian commented 8 years ago

Hello Scott,

Thank you very much for your email and introduction. I am not an expert in python, but I want to take a look at your python program, please share it so that I can look at it.

Thank you again! best regards, Mahmoud

On Feb 15, 2016, at 11:23 AM, sehunley notifications@github.com wrote:

When submitting a Java or Scala program, everything works fine. When submitting a python program, it's gets to the ACCEPTED state and then stalls. It eventually times out, but it's not getting picked up to run. Is this interface just for Java/Scala programs/jobs or should it be able to submit PySpark/Python jobs as well?

Thanks,

-Scott

— Reply to this email directly or view it on GitHub https://github.com/mahmoudparsian/data-algorithms-book/issues/4.

sehunley commented 8 years ago

Hello Mahmoud, My Python program is pretty simple, it's a sample program that came with Spark 1.6.0. Our Data Scientists are doing their Machine Learning scripts in python, hence the need to call python programatically. Here is the program:

from future import print_function

import sys from random import random from operator import add

from pyspark import SparkContext

if name == "main": """ Usage: pi [partitions] """ sc = SparkContext(appName="PythonPi") partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 < 1 else 0

count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

sc.stop()

Thanks in advance for any help that you can give. Here is a link to the actual output that happens when I run the code and try to submit the program above: [http://stackoverflow.com/questions/35373367/issue-submitting-a-python-application-to-yarn-from-java-code]

-Scott