intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
11 stars 3 forks source link

How to use Analytics-zoo when SparkSession is automatically instantiated #61

Open guidiandrea opened 2 years ago

guidiandrea commented 2 years ago

Hello,

I am trying to use Analytics-zoo in a Hadoop/YARN environment via JupyterHub/Lab where the SparkSession and context are automatically instantiated when the first cell is run.

How can I init the environment in this case?

jason-dai commented 2 years ago

We are working on the explicit support for this scenario in https://github.com/intel-analytics/analytics-zoo/pull/4339; for now, you may do something like:

init_orca_context(cluster_mode="spark-submit", ...)

See https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/orca/common.py#L161

hkvision commented 2 years ago

We are working on this and would finish it very soon.

guidiandrea commented 2 years ago

Hello @hkvision @jason-dai

I tried what you said but I'm getting a 'JavaPackage object is not callable' error

bigdl java package

What might it be due to?

Thanks

hkvision commented 2 years ago

Hi @guidiandrea

Since you already have a SparkSession, you need to manually upload the jar for Analytics Zoo before initializing the SparkSession. You may refer to our guide for DataBricks to do similar things in your environment: https://analytics-zoo.readthedocs.io/en/latest/doc/UserGuide/databricks.html#installing-analytics-zoo-libraries More specifically, the following paragraph in the page:

Install Analytics Zoo python environment using prebuilt release Wheel package. Click Libraries > Install New > Upload > Python Whl. Download Analytics Zoo prebuilt Wheel here. Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks.

Feel free to tell us if you encounter further issues :)

guidiandrea commented 2 years ago

Hi @guidiandrea

Since you already have a SparkSession, you need to manually upload the jar for Analytics Zoo before initializing the SparkSession. You may refer to our guide for DataBricks to do similar things in your environment: https://analytics-zoo.readthedocs.io/en/latest/doc/UserGuide/databricks.html#installing-analytics-zoo-libraries More specifically, the following paragraph in the page:

Install Analytics Zoo python environment using prebuilt release Wheel package. Click Libraries > Install New > Upload > Python Whl. Download Analytics Zoo prebuilt Wheel here. Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks.

Feel free to tell us if you encounter further issues :)

Hello @hkvision, thanks for your reply.

I already installed analytics-zoo using pip in the virtual env that I'm shipping to my yarn application, and I'm loading BigDL through jars because I have Spark 2.3 so I can't install BigDL using PIP (it will automatically bring Pyspark 2.4.6)

Should I build everything from source in order to avoid collisions? The linux env is a CentOS-like.

hkvision commented 2 years ago

But actually pip install analytics-zoo will also install bigdl and pyspark2.4.6, how can you only pip install analytics-zoo? If you are using Spark 2.3, I suppose you may need to use spark-submit and specify the jars when spark-submit? (for the init_orca_context code you don't need to modify anything :)

guidiandrea commented 2 years ago

But actually pip install analytics-zoo will also install bigdl and pyspark2.4.6, how can you only pip install analytics-zoo? If you are using Spark 2.3, I suppose you may need to use spark-submit and specify the jars when spark-submit? (for the init_orca_context code you don't need to modify anything :)

Yep, I needed to modify dependencies as the analytics zoo's prebuilt wheel for spark 2.3 was trying to install pyspark 2.4 but of course that makes no sense xP

hkvision commented 2 years ago

We have some released spark whls for spark 2.3: https://sourceforge.net/projects/analytics-zoo/files/zoo-py/ and probably you may have a try? cc @Le-Zheng Or you may use the spark-submit directly to play safe :)

guidiandrea commented 2 years ago

Hi @guidiandrea

Since you already have a SparkSession, you need to manually upload the jar for Analytics Zoo before initializing the SparkSession. You may refer to our guide for DataBricks to do similar things in your environment: https://analytics-zoo.readthedocs.io/en/latest/doc/UserGuide/databricks.html#installing-analytics-zoo-libraries More specifically, the following paragraph in the page:

Install Analytics Zoo python environment using prebuilt release Wheel package. Click Libraries > Install New > Upload > Python Whl. Download Analytics Zoo prebuilt Wheel here. Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks.

Feel free to tell us if you encounter further issues :)

I tried following the guide and I downloaded the correct version of the prebuilt jar with the dependencies but now I'm getting the following error

MicrosoftTeams-image (2)

. What may be it due to? Thanks

hkvision commented 2 years ago

Seems it is an issue due to the cluster 0_0? Is it brought by adding the analytics-zoo jar? If so, can you provide more details (for example the command you use to submit the jar?)

guidiandrea commented 2 years ago

@hkvision Yep, I have this problem when adding the analytics-zoo JAR into sparkmagic configurations (we run Jupyter with JupyterHub, so when we launch the first cell a yarn application is started and named as LivySession).

I added these configurations to the sparkmagic/config.json file:

"spark.driver.extraClassPath":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", "spark.executor.extraClassPath":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", "spark.jars":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar",

I was able to use BigDL without Analytics-Zoo using these settings, so they're pretty much correct.

helenlly commented 2 years ago

@qiuxin2012 @Le-Zheng any comments?

qiuxin2012 commented 2 years ago

@guidiandrea Could you provide some information about your environment?

  1. Where and how is your SPARK 2.3 installed?
  2. Do you have conda environment, and is your jupyter using the python in conda?
  3. You mentioned Livy, is your spark context created by Livy? Can you use a pure python notebook?
qiuxin2012 commented 2 years ago

@hkvision Yep, I have this problem when adding the analytics-zoo JAR into sparkmagic configurations (we run Jupyter with JupyterHub, so when we launch the first cell a yarn application is started and named as LivySession).

I added these configurations to the sparkmagic/config.json file:

"spark.driver.extraClassPath":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", "spark.executor.extraClassPath":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", "spark.jars":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar",

I was able to use BigDL without Analytics-Zoo using these settings, so they're pretty much correct.

Could you check the size of your analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar, 217MB or 403MB? And make sure the spark.jars you set is correct.

guidiandrea commented 2 years ago

Hello, The size of the file is 403 MB. What do you mean by ‘check that spark.jars is correct’?

Inviato da myMail per iOS

venerdì 29 ottobre 2021, 06:46 +0100 da @. @.>: @.*** Yep, I have this problem when adding the analytics-zoo JAR into sparkmagic configurations (we run Jupyter with JupyterHub, so when we launch the first cell a yarn application is started and named as LivySession).

I added these configurations to the sparkmagic/config.json file: "spark.driver.extraClassPath":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", "spark.executor.extraClassPath":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", "spark.jars":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", I was able to use BigDL without Analytics-Zoo using these settings, so they're pretty much correct. Could you check the size of your analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar, 217MB or 403MB? And make sure the spark.jars you set is correct. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .

qiuxin2012 commented 2 years ago

Hello, The size of the file is 403 MB. What do you mean by ‘check that spark.jars is correct’? Inviato da myMail per iOS venerdì 29 ottobre 2021, 06:46 +0100 da @. @.>: @.*** Yep, I have this problem when adding the analytics-zoo JAR into sparkmagic configurations (we run Jupyter with JupyterHub, so when we launch the first cell a yarn application is started and named as LivySession).

I added these configurations to the sparkmagic/config.json file: >"spark.driver.extraClassPath":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", "spark.executor.extraClassPath":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", "spark.jars":"/analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar", >I was able to use BigDL without Analytics-Zoo using these settings, so they're pretty much correct. Could you check the size of your analytics-zoo-bigdl_0.13.0-spark_2.3.1-0.12.0-20210908.203333-39-jar-with-dependencies.jar, 217MB or 403MB? And make sure the spark.jars you set is correct. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .

Open the environment page in spark driver's webui, my is http://xin-dev.sh.intel.com:4040/environment/, can you see the analytics-zoo's jar in Classpath Entries?

qiuxin2012 commented 2 years ago

@guidiandrea I tried to run init_orca_context in latest BigDL when sparksession is instantiated by pyspark, I got JavaPackage object is not callable when --jars (your spark.jars) is not provided. See issue https://github.com/intel-analytics/BigDL/issues/3351 for more details. In your error message I found an useful Warning, the analytics-zoo's jar is skipped. image So you need to open your environment page in spark driver's webui, can check if analytics-zoo's jar is in the Classpath Entries.