intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.74k stars 1.27k forks source link

Please explain BigDL internals in relation to pySpark ['JavaPackage' object is not callable] #3123

Closed Adamage closed 3 years ago

Adamage commented 3 years ago

Hello everyone.

I am struggling to correctly configure BigDL on our Hadoop/Spark setup. Normally we use SparkML + Livy2 in a Jupyter Notebook to ask Yarn for drivers, executors etc.

As I understand, when I am already inside a pySpark container, PySpark is already loaded, and actually I should be using the one inside BigDL "home" library directory?

Some more details of what I am doing:

Some conflicts are reported when I import the libraries in Jupyter cells, for example: pyenv-3.7.10-v6.zip/3.7.10/envs/pyenv-3.7.10-v6/lib/python3.7/site-packages/zoo/util/engine.py:42: UserWarning: Find both SPARK_HOME and pyspark. You may need to check whether they match with each other. SPARK_HOME environment variable is set to: ., and pyspark is found in: /cdh/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4460.8174152/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py. If they are unmatched, you are recommended to use one source only to avoid conflict. For example, you can unset SPARK_HOME and use pyspark only. warnings.warn(warning_msg)

Livy Setup is as such, that I provide a zipped venv stored in HDFS via --archives. The zipped venv has inside bigdl, pyspark, analytics-zoo

The result is that I have the popular error "JavaPackage" not callable.

'JavaPackage' object is not callable
Traceback (most recent call last):
  File "pyenv-3.7.10-v6.zip/3.7.10/envs/pyenv-3.7.10-v6/lib/python3.7/site-packages/bigdl/nn/layer.py", line 1241, in __init__
    super(Sequential, self).__init__(jvalue, bigdl_type)
  File "pyenv-3.7.10-v6.zip/3.7.10/envs/pyenv-3.7.10-v6/lib/python3.7/site-packages/bigdl/nn/layer.py", line 686, in __init__
    super(Container, self).__init__(jvalue, bigdl_type, *args)
  File "pyenv-3.7.10-v6.zip/3.7.10/envs/pyenv-3.7.10-v6/lib/python3.7/site-packages/bigdl/nn/layer.py", line 130, in __init__
    bigdl_type, self.jvm_class_constructor(), *args)
  File "pyenv-3.7.10-v6.zip/3.7.10/envs/pyenv-3.7.10-v6/lib/python3.7/site-packages/bigdl/util/common.py", line 592, in callBigDlFunc
    for jinvoker in JavaCreator.instance(bigdl_type, gateway).value:
  File "pyenv-3.7.10-v6.zip/3.7.10/envs/pyenv-3.7.10-v6/lib/python3.7/site-packages/bigdl/util/common.py", line 56, in instance
    cls._instance = cls(bigdl_type, *args)
  File "pyenv-3.7.10-v6.zip/3.7.10/envs/pyenv-3.7.10-v6/lib/python3.7/site-packages/bigdl/util/common.py", line 96, in __init__
    self.value.append(getattr(jclass, "ofFloat")())
TypeError: 'JavaPackage' object is not callable

What am I doing wrong?

cyita commented 3 years ago

You can try to export PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON and pass spark.executorEnv.PYTHONHOME using --conf spark.executorEnv.PYTHONHOME=...

Adamage commented 3 years ago

@cyita Hello. But is that the issue? I am using the correct Python from my zipped environment - it has all the libraries, it allows to import bigdl, zoo, tensorflow, keras, pyspark, it's all there. (the env vars you mentioned I have already set) But when I try to run init_engine() it has some kind of a problem with JavaPackage.

Has this problem been identified? Is JavaCreator.instance(bigdl_type, gateway) looking for some Java classes I might not have visible in CLASSPATH?

Adamage commented 3 years ago

Ok this is for later generations - whoever stumbles upon this JavaPackage thing. There is an easy solution, just make sure the BigDL uber jar with dependencies is visible by Spark.

spark.jars property needs to have this appended at the end.

I was able to use IBM Watson, Livy2, Spark and BigDL - Spark launched executors and performed BigDL estimators

qiuxin2012 commented 3 years ago

Great work. Yes, we need bigdl jar in the spark.jars. https://bigdl-project.github.io/master/#PythonUserGuide/run-without-pip/#run-with-virtual-environment-in-yarn may help. 'JavaPackage' object is not callable is because python couldn't find the java class.

Ishitori commented 1 year ago

@Adamage , could you share your Livy setup to launch BigDL job? I have some issues launching even simplest example using AWS EMR Notebook (https://github.com/intel-analytics/BigDL/issues/7764), and so far it fails. Maybe I can reuse your livy and/or notebook configuration?