palantir / spark

Palantir Distribution of Apache Spark
Apache License 2.0
67 stars 51 forks source link

[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import #745

Closed johnhany97 closed 3 years ago

johnhany97 commented 3 years ago

Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

SPARK-34803 https://github.com/apache/spark/pull/31902

What changes were proposed in this pull request?

Pass the raised ImportError on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different ImportError.

Why are the changes needed?

This can already happen in Pandas for example where it could throw an ImportError on its initialisation path if dateutil doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438

Does this PR introduce any user-facing change?

Yes, it will now show the root cause of the exception when pandas or arrow is missing during import.

How was this patch tested?

Manually tested.

from pyspark.sql.functions import pandas_udf
spark.range(1).select(pandas_udf(lambda x: x, "int")("id")).show()

Before:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version
    raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.

After:

Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version
    import pyarrow
ModuleNotFoundError: No module named 'pyarrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf
    require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version
    raise ImportError("PyArrow >= %s must be installed; however, "
ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
rshkv commented 3 years ago

Thank you for this. And thank you for contributing upstream.