intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.6k stars 1.26k forks source link

SparkXShards of Pandas DataFrame with ndarray column can't be converted to SparkDataFrame with arrow #7498

Open hkvision opened 1 year ago

hkvision commented 1 year ago
import pandas as pd
def gen_pdf(x):
    # test_list = [['a','b'], ['AA','BB'], ["cc", "dd"]]
    # pdf = pd.DataFrame(test_list, columns=['col_A', 'col_B'])
    # fail
    test_list = [['a','b',np.random.rand(2, 3, 4)], ['AA','BB',np.random.rand(2, 3, 4)], ["cc", "dd", np.random.rand(2, 3, 4)]]
    # can work 
    # test_list = [['a','b',np.random.rand(2, 3, 4).tolist()], ['AA','BB',np.random.rand(2, 3, 4).tolist()], ["cc", "dd", np.random.rand(2, 3, 4).tolist()]]
    pdf = pd.DataFrame(test_list, columns=['col_A', 'col_B', 'col_C'])
    return pdf

rdd = sc.range(5, numSlices=5).map(gen_pdf)
xshards = SparkXShards(rdd)
df = xshards.to_spark_df()

gets the error

  File "/home/kai/BigDL/python/orca/src/bigdl/orca/data/shard.py", line 930, in <lambda>
    return self.rdd.map(lambda x: func(x)).first()
  File "/home/kai/BigDL/python/orca/src/bigdl/orca/data/shard.py", line 920, in func
    arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
  File "pyarrow/types.pxi", line 1492, in pyarrow.lib.Schema.from_pandas
  File "/home/kai/anaconda3/envs/py37-tf2/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 534, in dataframe_to_types
    type_ = pa.array(c, from_pandas=True).type
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values

converting numy.tolist() can succeed.

@sgwhat

hkvision commented 1 year ago

When converting SparkXShards of pandas df to Spark df, need to convert numpy columns (if any) to list. For better performance, we can refer to this PR: https://github.com/intel-analytics/BigDL/pull/5952 to use arrow for the conversion.