sourav-mazumder / Data-Science-Extensions

71 stars 63 forks source link

Unable to spark-submit jar file on dockerized spark #7

Open adrien19 opened 5 years ago

adrien19 commented 5 years ago

Hi,

I have tried many different ways to use the spark-datasource-rest in my application but none is working. I am using a docker image running jupyter/pyspark-notebook.

First, I used the $SPARK_HOME/bin/spark-shell --jars spark-datasource-rest_2.11-2.1.0-SNAPSHOT.jar --packages org.scalaj:scalaj-http_2.10:2.3.0 but with this I can't use the spark extension in the notebook. Below is the error I am receiving:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-3-438581f3ccce> in <module>
     18 # Now we create the Dataframe which contains the result from the call to the Soda API for the 3 different input data points
     19 
---> 20 sodasDf = spark.read.format("org.apache.dsext.spark.datasource.rest.RestDataSource").options(**prmsSoda).load()
     21 
     22 # We inspect the structure of the results returned. For Soda data source it would return the result in array.

.......
Py4JJavaError: An error occurred while calling o44.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.dsext.spark.datasource.rest.RestDataSource. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.dsext.spark.datasource.rest.RestDataSource.DefaultSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
    at scala.util.Try$.apply(Try.scala:192)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
    at scala.util.Try.orElse(Try.scala:84)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
    ... 13 more

Second, I tried with spark-submit by running this command $SPARK_HOME/bin/spark-submit spark-datasource-rest_2.11-2.1.0-SNAPSHOT.jar. However, I get an error indicating that no class was specified.

Executing transaction: ...working... done
Exception in thread "main" org.apache.spark.SparkException: No main class set in JAR; please specify one with --class
    at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
    at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:266)
    at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:251)
    at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:120)
    at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:911)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
ERROR: Service 'jupyter-notebook' failed to build: The command '/bin/sh -c conda install --quiet --yes     'pandas'     'pandas-gbq' --channel conda-forge     && $SPARK_HOME/bin/spark-submit     spark-datasource-rest_2.11-2.1.0-SNAPSHOT.jar  returned a non-zero code: 1

How can I include this spark-datasource-rest_2.11-2.1.0-SNAPSHOT.jar to work with spark running in docker container?

Thanks!

vasquezk26 commented 4 years ago

@adrien19 did you ever find a solution to this? I ran into the same problem.

@sourav-mazumder can you help assist with this, please?