prefix-dev / pixi

Package management made easy
https://pixi.sh
BSD 3-Clause "New" or "Revised" License
2.91k stars 161 forks source link

tests fail in pycharm on spark tests when importing an external lib #1869

Open lev112 opened 3 weeks ago

lev112 commented 3 weeks ago

Checks

Reproducible example

full example here: https://github.com/lev112/pycharm-pixi-spark-test

pixi.toml:

[project]
channels = ["conda-forge"]
description = "Add a short description here"
name = "pixi_example"
platforms = ["osx-arm64"]
version = "0.1.0"

[dependencies]
python = ">=3.10, <3.11"
pixi-pycharm = "*"

[pypi-dependencies]
pyspark = "==3.4.3"
pytest = "*"
xxhash = "*"

tests/test_spark.py:

import xxhash
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def to_xxhash(x: int) -> str:
    hashed = str(xxhash.xxh64_intdigest(str(x)))

    return hashed

def test_spark():
    spark = (
        SparkSession.builder.appName("spark test")
        .master("local[1]")
        .getOrCreate()
    )

    my_udf = udf(to_xxhash, returnType=StringType())

    spark.range(1,5).withColumn("id", my_udf("id")).show()

pixi.lock:

version: 5
environments:
  default:
    channels:
    - url: https://conda.anaconda.org/conda-forge/
    indexes:
    - https://pypi.org/simple
    packages:
      osx-arm64:
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/bzip2-1.0.8-h99b78c6_7.conda
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/ca-certificates-2024.7.4-hf0a4a13_0.conda
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/libffi-3.4.2-h3422bc3_5.tar.bz2
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/libsqlite-3.46.0-hfb93653_0.conda
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/libzlib-1.3.1-hfb2fe0b_1.conda
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/ncurses-6.5-hb89a1cb_0.conda
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/openssl-3.3.1-hfb2fe0b_2.conda
      - conda: https://conda.anaconda.org/conda-forge/noarch/pixi-pycharm-0.0.6-unix_1234567_0.conda
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/python-3.10.14-h2469fbe_0_cpython.conda
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/readline-8.2-h92ec313_1.conda
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/tk-8.6.13-h5083fa2_1.conda
      - conda: https://conda.anaconda.org/conda-forge/noarch/tzdata-2024a-h0c530f3_0.conda
      - conda: https://conda.anaconda.org/conda-forge/osx-arm64/xz-5.2.6-h57fd34a_0.tar.bz2
      - pypi: https://files.pythonhosted.org/packages/02/cc/b7e31358aac6ed1ef2bb790a9746ac2c69bcb3c8588b41616914eb106eaf/exceptiongroup-1.2.2-py3-none-any.whl
      - pypi: https://files.pythonhosted.org/packages/ef/a6/62565a6e1cf69e10f5727360368e451d4b7f58beeac6173dc9db836a5b46/iniconfig-2.0.0-py3-none-any.whl
      - pypi: https://files.pythonhosted.org/packages/08/aa/cc0199a5f0ad350994d660967a8efb233fe0416e4639146c089643407ce6/packaging-24.1-py3-none-any.whl
      - pypi: https://files.pythonhosted.org/packages/88/5f/e351af9a41f866ac3f1fac4ca0613908d9a41741cfcf2228f4ad853b697d/pluggy-1.5.0-py3-none-any.whl
      - pypi: https://files.pythonhosted.org/packages/10/30/a58b32568f1623aaad7db22aa9eafc4c6c194b429ff35bdc55ca2726da47/py4j-0.10.9.7-py2.py3-none-any.whl
      - pypi: https://files.pythonhosted.org/packages/6d/fe/3d8f6190536c4d3ed24540872c00f13ab9beb27b78dbae1703b5368838d4/pyspark-3.4.3.tar.gz
      - pypi: https://files.pythonhosted.org/packages/0f/f9/cf155cf32ca7d6fa3601bc4c5dd19086af4b320b706919d48a4c79081cf9/pytest-8.3.2-py3-none-any.whl
      - pypi: https://files.pythonhosted.org/packages/97/75/10a9ebee3fd790d20926a90a2547f0bf78f371b2f13aa822c759680ca7b9/tomli-2.0.1-py3-none-any.whl
      - pypi: https://files.pythonhosted.org/packages/16/e6/be5aa49580cd064a18200ab78e29b88b1127e1a8c7955eb8ecf81f2626eb/xxhash-3.5.0-cp310-cp310-macosx_11_0_arm64.whl
packages:
- kind: conda
  name: bzip2
  version: 1.0.8
  build: h99b78c6_7
  build_number: 7
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/bzip2-1.0.8-h99b78c6_7.conda
  sha256: adfa71f158cbd872a36394c56c3568e6034aa55c623634b37a4836bd036e6b91
  md5: fc6948412dbbbe9a4c9ddbbcfe0a79ab
  depends:
  - __osx >=11.0
  license: bzip2-1.0.6
  license_family: BSD
  purls: []
  size: 122909
  timestamp: 1720974522888
- kind: conda
  name: ca-certificates
  version: 2024.7.4
  build: hf0a4a13_0
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/ca-certificates-2024.7.4-hf0a4a13_0.conda
  sha256: 33a61116dae7f369b6ce92a7f2a1ff361ae737c675a493b11feb5570b89e0e3b
  md5: 21f9a33e5fe996189e470c19c5354dbe
  license: ISC
  purls: []
  size: 154517
  timestamp: 1720077468981
- kind: pypi
  name: exceptiongroup
  version: 1.2.2
  url: https://files.pythonhosted.org/packages/02/cc/b7e31358aac6ed1ef2bb790a9746ac2c69bcb3c8588b41616914eb106eaf/exceptiongroup-1.2.2-py3-none-any.whl
  sha256: 3111b9d131c238bec2f8f516e123e14ba243563fb135d3fe885990585aa7795b
  requires_dist:
  - pytest>=6 ; extra == 'test'
  requires_python: '>=3.7'
- kind: pypi
  name: iniconfig
  version: 2.0.0
  url: https://files.pythonhosted.org/packages/ef/a6/62565a6e1cf69e10f5727360368e451d4b7f58beeac6173dc9db836a5b46/iniconfig-2.0.0-py3-none-any.whl
  sha256: b6a85871a79d2e3b22d2d1b94ac2824226a63c6b741c88f7ae975f18b6778374
  requires_python: '>=3.7'
- kind: conda
  name: libffi
  version: 3.4.2
  build: h3422bc3_5
  build_number: 5
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/libffi-3.4.2-h3422bc3_5.tar.bz2
  sha256: 41b3d13efb775e340e4dba549ab5c029611ea6918703096b2eaa9c015c0750ca
  md5: 086914b672be056eb70fd4285b6783b6
  license: MIT
  license_family: MIT
  purls: []
  size: 39020
  timestamp: 1636488587153
- kind: conda
  name: libsqlite
  version: 3.46.0
  build: hfb93653_0
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/libsqlite-3.46.0-hfb93653_0.conda
  sha256: 73048f9cb8647d3d3bfe6021c0b7d663e12cffbe9b4f31bd081e713b0a9ad8f9
  md5: 12300188028c9bc02da965128b91b517
  depends:
  - __osx >=11.0
  - libzlib >=1.2.13,<2.0a0
  license: Unlicense
  purls: []
  size: 830198
  timestamp: 1718050644825
- kind: conda
  name: libzlib
  version: 1.3.1
  build: hfb2fe0b_1
  build_number: 1
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/libzlib-1.3.1-hfb2fe0b_1.conda
  sha256: c34365dd37b0eab27b9693af32a1f7f284955517c2cc91f1b88a7ef4738ff03e
  md5: 636077128927cf79fd933276dc3aed47
  depends:
  - __osx >=11.0
  constrains:
  - zlib 1.3.1 *_1
  license: Zlib
  license_family: Other
  purls: []
  size: 46921
  timestamp: 1716874262512
- kind: conda
  name: ncurses
  version: '6.5'
  build: hb89a1cb_0
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/ncurses-6.5-hb89a1cb_0.conda
  sha256: 87d7cf716d9d930dab682cb57b3b8d3a61940b47d6703f3529a155c938a6990a
  md5: b13ad5724ac9ae98b6b4fd87e4500ba4
  license: X11 AND BSD-3-Clause
  purls: []
  size: 795131
  timestamp: 1715194898402
- kind: conda
  name: openssl
  version: 3.3.1
  build: hfb2fe0b_2
  build_number: 2
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/openssl-3.3.1-hfb2fe0b_2.conda
  sha256: dd7d988636f74473ebdfe15e05c5aabdb53a1d2a846c839d62289b0c37f81548
  md5: 9b551a504c1cc8f8b7b22c01814da8ba
  depends:
  - __osx >=11.0
  - ca-certificates
  constrains:
  - pyopenssl >=22.1
  license: Apache-2.0
  license_family: Apache
  purls: []
  size: 2899682
  timestamp: 1721194599446
- kind: pypi
  name: packaging
  version: '24.1'
  url: https://files.pythonhosted.org/packages/08/aa/cc0199a5f0ad350994d660967a8efb233fe0416e4639146c089643407ce6/packaging-24.1-py3-none-any.whl
  sha256: 5b8f2217dbdbd2f7f384c41c628544e6d52f2d0f53c6d0c3ea61aa5d1d7ff124
  requires_python: '>=3.8'
- kind: conda
  name: pixi-pycharm
  version: 0.0.6
  build: unix_1234567_0
  subdir: noarch
  noarch: generic
  url: https://conda.anaconda.org/conda-forge/noarch/pixi-pycharm-0.0.6-unix_1234567_0.conda
  sha256: 156dd789a87ea8b6d7c329efbc202d7f21564d5afabbad9c6ac14017fb47c641
  md5: c140222c6019ef59c509f4740663fa75
  depends:
  - __unix
  - python >=3.8
  license: BSD-3-Clause
  license_family: BSD
  purls: []
  size: 8833
  timestamp: 1719726313982
- kind: pypi
  name: pluggy
  version: 1.5.0
  url: https://files.pythonhosted.org/packages/88/5f/e351af9a41f866ac3f1fac4ca0613908d9a41741cfcf2228f4ad853b697d/pluggy-1.5.0-py3-none-any.whl
  sha256: 44e1ad92c8ca002de6377e165f3e0f1be63266ab4d554740532335b9d75ea669
  requires_dist:
  - pre-commit ; extra == 'dev'
  - tox ; extra == 'dev'
  - pytest ; extra == 'testing'
  - pytest-benchmark ; extra == 'testing'
  requires_python: '>=3.8'
- kind: pypi
  name: py4j
  version: 0.10.9.7
  url: https://files.pythonhosted.org/packages/10/30/a58b32568f1623aaad7db22aa9eafc4c6c194b429ff35bdc55ca2726da47/py4j-0.10.9.7-py2.py3-none-any.whl
  sha256: 85defdfd2b2376eb3abf5ca6474b51ab7e0de341c75a02f46dc9b5976f5a5c1b
- kind: pypi
  name: pyspark
  version: 3.4.3
  url: https://files.pythonhosted.org/packages/6d/fe/3d8f6190536c4d3ed24540872c00f13ab9beb27b78dbae1703b5368838d4/pyspark-3.4.3.tar.gz
  sha256: 8d7025fa274830cb6c3bd592228be3d9345cb3b8b1e324018c2aa6e75f48a208
  requires_dist:
  - py4j==0.10.9.7
  - pandas>=1.0.5 ; extra == 'connect'
  - pyarrow>=1.0.0 ; extra == 'connect'
  - grpcio>=1.48.1 ; extra == 'connect'
  - grpcio-status>=1.48.1 ; extra == 'connect'
  - googleapis-common-protos>=1.56.4 ; extra == 'connect'
  - numpy>=1.15 ; extra == 'connect'
  - numpy>=1.15 ; extra == 'ml'
  - numpy>=1.15 ; extra == 'mllib'
  - pandas>=1.0.5 ; extra == 'pandas-on-spark'
  - pyarrow>=1.0.0 ; extra == 'pandas-on-spark'
  - numpy>=1.15 ; extra == 'pandas-on-spark'
  - pandas>=1.0.5 ; extra == 'sql'
  - pyarrow>=1.0.0 ; extra == 'sql'
  - numpy>=1.15 ; extra == 'sql'
  requires_python: '>=3.7'
- kind: pypi
  name: pytest
  version: 8.3.2
  url: https://files.pythonhosted.org/packages/0f/f9/cf155cf32ca7d6fa3601bc4c5dd19086af4b320b706919d48a4c79081cf9/pytest-8.3.2-py3-none-any.whl
  sha256: 4ba08f9ae7dcf84ded419494d229b48d0903ea6407b030eaec46df5e6a73bba5
  requires_dist:
  - iniconfig
  - packaging
  - pluggy<2,>=1.5
  - exceptiongroup>=1.0.0rc8 ; python_version < '3.11'
  - tomli>=1 ; python_version < '3.11'
  - colorama ; sys_platform == 'win32'
  - argcomplete ; extra == 'dev'
  - attrs>=19.2 ; extra == 'dev'
  - hypothesis>=3.56 ; extra == 'dev'
  - mock ; extra == 'dev'
  - pygments>=2.7.2 ; extra == 'dev'
  - requests ; extra == 'dev'
  - setuptools ; extra == 'dev'
  - xmlschema ; extra == 'dev'
  requires_python: '>=3.8'
- kind: conda
  name: python
  version: 3.10.14
  build: h2469fbe_0_cpython
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/python-3.10.14-h2469fbe_0_cpython.conda
  sha256: 454d609fe25daedce9e886efcbfcadad103ed0362e7cb6d2bcddec90b1ecd3ee
  md5: 4ae999c8227c6d8c7623d32d51d25ea9
  depends:
  - bzip2 >=1.0.8,<2.0a0
  - libffi >=3.4,<4.0a0
  - libsqlite >=3.45.2,<4.0a0
  - libzlib >=1.2.13,<2.0.0a0
  - ncurses >=6.4.20240210,<7.0a0
  - openssl >=3.2.1,<4.0a0
  - readline >=8.2,<9.0a0
  - tk >=8.6.13,<8.7.0a0
  - tzdata
  - xz >=5.2.6,<6.0a0
  constrains:
  - python_abi 3.10.* *_cp310
  license: Python-2.0
  purls: []
  size: 12336005
  timestamp: 1710939659384
- kind: conda
  name: readline
  version: '8.2'
  build: h92ec313_1
  build_number: 1
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/readline-8.2-h92ec313_1.conda
  sha256: a1dfa679ac3f6007362386576a704ad2d0d7a02e98f5d0b115f207a2da63e884
  md5: 8cbb776a2f641b943d413b3e19df71f4
  depends:
  - ncurses >=6.3,<7.0a0
  license: GPL-3.0-only
  license_family: GPL
  purls: []
  size: 250351
  timestamp: 1679532511311
- kind: conda
  name: tk
  version: 8.6.13
  build: h5083fa2_1
  build_number: 1
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/tk-8.6.13-h5083fa2_1.conda
  sha256: 72457ad031b4c048e5891f3f6cb27a53cb479db68a52d965f796910e71a403a8
  md5: b50a57ba89c32b62428b71a875291c9b
  depends:
  - libzlib >=1.2.13,<2.0.0a0
  license: TCL
  license_family: BSD
  purls: []
  size: 3145523
  timestamp: 1699202432999
- kind: pypi
  name: tomli
  version: 2.0.1
  url: https://files.pythonhosted.org/packages/97/75/10a9ebee3fd790d20926a90a2547f0bf78f371b2f13aa822c759680ca7b9/tomli-2.0.1-py3-none-any.whl
  sha256: 939de3e7a6161af0c887ef91b7d41a53e7c5a1ca976325f429cb46ea9bc30ecc
  requires_python: '>=3.7'
- kind: conda
  name: tzdata
  version: 2024a
  build: h0c530f3_0
  subdir: noarch
  noarch: generic
  url: https://conda.anaconda.org/conda-forge/noarch/tzdata-2024a-h0c530f3_0.conda
  sha256: 7b2b69c54ec62a243eb6fba2391b5e443421608c3ae5dbff938ad33ca8db5122
  md5: 161081fc7cec0bfda0d86d7cb595f8d8
  license: LicenseRef-Public-Domain
  purls: []
  size: 119815
  timestamp: 1706886945727
- kind: pypi
  name: xxhash
  version: 3.5.0
  url: https://files.pythonhosted.org/packages/16/e6/be5aa49580cd064a18200ab78e29b88b1127e1a8c7955eb8ecf81f2626eb/xxhash-3.5.0-cp310-cp310-macosx_11_0_arm64.whl
  sha256: 3171f693dbc2cef6477054a665dc255d996646b4023fe56cb4db80e26f4cc520
  requires_python: '>=3.7'
- kind: conda
  name: xz
  version: 5.2.6
  build: h57fd34a_0
  subdir: osx-arm64
  url: https://conda.anaconda.org/conda-forge/osx-arm64/xz-5.2.6-h57fd34a_0.tar.bz2
  sha256: 59d78af0c3e071021cfe82dc40134c19dab8cdf804324b62940f5c8cd71803ec
  md5: 39c6b54e94014701dd157f4f576ed211
  license: LGPL-2.1 and GPL-2.0
  purls: []
  size: 235693
  timestamp: 1660346961024

Issue description

when running the test from the command line with:

pixi run pytest

the test passes.

but when running the same test from pycharm, it fails with these logs:

24/08/21 15:54:24 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/lev/work/git/tmp/pixi_example/tests/test_spark.py", line 1, in <module>
    import xxhash
ModuleNotFoundError: No module named 'xxhash'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:561)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:94)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:75)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:514)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:139)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
24/08/21 15:54:24 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (192.168.0.12 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/lev/work/git/tmp/pixi_example/tests/test_spark.py", line 1, in <module>
    import xxhash
ModuleNotFoundError: No module named 'xxhash'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:561)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:94)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:75)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:514)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:139)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

24/08/21 15:54:24 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job

test_spark.py:17 (test_spark)
def test_spark():
        spark = (
            SparkSession.builder.appName("spark test")
            .master("local[1]")
            .getOrCreate()
        )

        my_udf = udf(to_xxhash, returnType=StringType())

>       spark.range(1,5).withColumn("id", my_udf("id")).show()

test_spark.py:27: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../.pixi/envs/default/lib/python3.10/site-packages/pyspark/sql/dataframe.py:901: in show
    print(self._jdf.showString(n, 20, vertical))
../.pixi/envs/default/lib/python3.10/site-packages/py4j/java_gateway.py:1322: in __call__
    return_value = get_return_value(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = ('xro45', <py4j.clientserver.JavaClient object at 0x10621c100>, 'o44', 'showString')
kw = {}
converted = PythonException('\n  An exception was thrown from the Python worker. Please see the stack trace below.\nTraceback (mos...1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\t... 1 more\n')

    def deco(*a: Any, **kw: Any) -> Any:
        try:
            return f(*a, **kw)
        except Py4JJavaError as e:
            converted = convert_exception(e.java_exception)
            if not isinstance(converted, UnknownException):
                # Hide where the exception came from that shows a non-Pythonic
                # JVM exception message.
>               raise converted from None
E               pyspark.errors.exceptions.captured.PythonException: 
E                 An exception was thrown from the Python worker. Please see the stack trace below.
E               Traceback (most recent call last):
E                 File "/Users/lev/work/git/tmp/pixi_example/tests/test_spark.py", line 1, in <module>
E                   import xxhash
E               ModuleNotFoundError: No module named 'xxhash'

../.pixi/envs/default/lib/python3.10/site-packages/pyspark/errors/exceptions/captured.py:175: PythonException

============================== 1 failed in 3.71s ===============================

Process finished with exit code 1

Expected behavior

the test should pass

ruben-arts commented 3 weeks ago

It sounds like spark is running some kind of logic to figure the imports out on their own. Could you make your example project public?

lev112 commented 3 weeks ago

Could you make your example project public?

done

ruben-arts commented 3 weeks ago

I can't reproduce this, it gives me this error:

>                   raise RuntimeError("Java gateway process exited before sending its port number")

It looks to expect me to have Java. I've never run spark. Could you make a more indepth reproducer

lev112 commented 3 weeks ago

I've updated the repo and added jvm can you please check?