Closed daveauerbach closed 2 years ago
Interestingly, if the pebble import happens first, the failure doesn't occur. However, this can't always be guaranteed if pebble is being used within library code that the spark code is using.
I am afraid pyspark
and Python multiprocessing don't go along together.
Reason is Spark itself is a platform designed to distribute workload onto multiple nodes and therefore it does not account the use of native multiprocessing. Moreover, where would you expect your processes to be spawned? On the driver or on the workers themselves? How would you instruct Spark to do so?
To reinforce the thesis, this is the result with standard multiprocessing
pool.
import pyspark
from multiprocessing import Pool
EXECUTOR = Pool()
def temp():
print("inside test function")
print("About to call test function in new process")
future = EXECUTOR.apply_async(temp)
print("Test function started, waiting for results")
res = future.get()
print("Test success")
Running it results in a visible crash followed by the process hanging exactly as for Pebble.
$ python test.py
About to call test function in new process
Test function started, waiting for results
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.9/multiprocessing/pool.py", line 114, in worker
task = get()
File "/usr/lib/python3.9/multiprocessing/queues.py", line 368, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'temp' on <module '__main__' from '/home/noxdafox/test.py'>
Closing this issue, please re-open it again in case of further discussion.
Pebble appears to not work correctly when
pyspark
is imported.Note that when a hung python process is interupted, the following stacktrace is observed:
Repro: 1) In a clean environment:
2) Demo code: