Open ChristianRue opened 4 years ago
Hi guys, I have the same issue running all the examples here that you provide: https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb
Running them with python 3.8 and Pipfile: [[source]] name = "pypi" url = "https://pypi.org/simple" verify_ssl = true
[dev-packages] flake8 = "*"
[packages] fastapi = "" uvicorn = "" pyyaml = "" pandas = "" psycopg2 = ">=2.8.4" colorlog = "" shapely = "" tqdm = "" googlemaps = "" timezonefinder = "" python-levenshtein = "" boto3 = "" polyline = "" geopandas = "" scipy = "" sklearn = "" colour = "" folium = "" matplotlib = "" seaborn = "" googleads = "" holoviews = "" console = "" nox = "" pytest = "" flake8 = "" coverage = "" pytest-cov = "" celery = "" redis = "" python-multipart = "" xlrd = "" jupyterlab = "" nbconvert = "" ipywidgets = "" rtree = "" pandarallel = ""
[requires] python_version = "3.8"
I have the same problem here! Using pandarallel on Windows 10 and Jupyter notebook.
Getting the same issue on MacOS in Python 3.8
Package Version
black 19.10b0
numpy 1.17.4
pandarallel 1.4.6
pandas 0.25.3
pip 20.0.2
Edit: Tried the same code in Python 3.7.3 which works no problem.
Getting the same issue: Windows 10 and Spyder
Having the same issue on Windows 10 with Jupyter Notebook
I managed to fix this inspired by this
Take a look at this issue
But something is working very weird:
The global scope is not visible anymore. For example, I have to import pandas again in order to use it in function passed into apply, otherwise it will throw NameError: name 'pd' is not defined
The other issue is irrelevant from this one. I have to put my entire code into if __name__ == '__main__':
, or I'll encounter issue 76. I'm not sure the proper way to handle it
Since I didn't solve it completely and I don't have any time and effort to work on it, I will not submit any pull request for now.
Here is my patch:
diff --git a/pandarallel/pandarallel.py b/pandarallel/pandarallel.py
index b7783ea..c3e918d 100644
--- a/pandarallel/pandarallel.py
+++ b/pandarallel/pandarallel.py
@@ -64,96 +64,112 @@ def is_memory_fs_available():
return os.path.exists(MEMORY_FS_ROOT)
-def prepare_worker(use_memory_fs):
- def closure(function):
- def wrapper(worker_args):
- """This function runs on WORKERS.
-
- If Memory File System is used:
- 1. Load all pickled files (previously dumped by the MASTER) in the
- Memory File System
- 2. Undill the function to apply (for lambda functions)
- 3. Tell to the MASTER the input file has been read (so the MASTER can remove it
- from the memory
- 4. Apply the function
- 5. Pickle the result in the Memory File System (so the Master can read it)
- 6. Tell the master task is finished
-
- If Memory File System is not used, steps are the same except 1. and 5. which are
- skipped.
- """
- if use_memory_fs:
- (
- input_file_path,
- output_file_path,
- index,
- meta_args,
- queue,
- progress_bar,
- dilled_func,
- args,
- kwargs,
- ) = worker_args
-
- try:
- with open(input_file_path, "rb") as file:
- data = pickle.load(file)
- queue.put((INPUT_FILE_READ, index))
-
- result = function(
- data,
- index,
- meta_args,
- queue,
- progress_bar,
- dill.loads(dilled_func),
- *args,
- **kwargs
- )
+class prepare_worker_with_memory_fs:
+ def __init__(self, func):
+ self.func = func
+
+ def __call__(self, worker_args):
+ """This function runs on WORKERS.
+
+ If Memory File System is used:
+ 1. Load all pickled files (previously dumped by the MASTER) in the
+ Memory File System
+ 2. Undill the function to apply (for lambda functions)
+ 3. Tell to the MASTER the input file has been read (so the MASTER can remove it
+ from the memory
+ 4. Apply the function
+ 5. Pickle the result in the Memory File System (so the Master can read it)
+ 6. Tell the master task is finished
+
+ If Memory File System is not used, steps are the same except 1. and 5. which are
+ skipped.
+ """
+ (
+ input_file_path,
+ output_file_path,
+ index,
+ meta_args,
+ queue,
+ progress_bar,
+ dilled_func,
+ args,
+ kwargs,
+ ) = worker_args
- with open(output_file_path, "wb") as file:
- pickle.dump(result, file)
+ try:
+ with open(input_file_path, "rb") as file:
+ data = pickle.load(file)
+ queue.put((INPUT_FILE_READ, index))
- queue.put((VALUE, index))
+ result = self.func(
+ data,
+ index,
+ meta_args,
+ queue,
+ progress_bar,
+ dill.loads(dilled_func),
+ *args,
+ **kwargs
+ )
- except Exception:
- queue.put((ERROR, index))
- raise
- else:
- (
- data,
- index,
- meta_args,
- queue,
- progress_bar,
- dilled_func,
- args,
- kwargs,
- ) = worker_args
-
- try:
- result = function(
- data,
- index,
- meta_args,
- queue,
- progress_bar,
- dill.loads(dilled_func),
- *args,
- **kwargs
- )
- queue.put((VALUE, index))
+ with open(output_file_path, "wb") as file:
+ pickle.dump(result, file)
- return result
+ queue.put((VALUE, index))
- except Exception:
- queue.put((ERROR, index))
- raise
+ except Exception:
+ queue.put((ERROR, index))
+ raise
- return wrapper
+class prepare_worker_without_memory_fs:
+ def __init__(self, func):
+ self.func = func
- return closure
+ def __call__(self, worker_args):
+ """This function runs on WORKERS.
+
+ If Memory File System is used:
+ 1. Load all pickled files (previously dumped by the MASTER) in the
+ Memory File System
+ 2. Undill the function to apply (for lambda functions)
+ 3. Tell to the MASTER the input file has been read (so the MASTER can remove it
+ from the memory
+ 4. Apply the function
+ 5. Pickle the result in the Memory File System (so the Master can read it)
+ 6. Tell the master task is finished
+
+ If Memory File System is not used, steps are the same except 1. and 5. which are
+ skipped.
+ """
+ (
+ data,
+ index,
+ meta_args,
+ queue,
+ progress_bar,
+ dilled_func,
+ args,
+ kwargs,
+ ) = worker_args
+
+ try:
+ result = self.func(
+ data,
+ index,
+ meta_args,
+ queue,
+ progress_bar,
+ dill.loads(dilled_func),
+ *args,
+ **kwargs
+ )
+ queue.put((VALUE, index))
+
+ return result
+ except Exception:
+ queue.put((ERROR, index))
+ raise
def create_temp_files(nb_files):
"""Create temporary files in Memory File System."""
@@ -438,9 +454,14 @@ def parallelize(
nb_workers = len(chunk_lengths)
try:
- pool = Pool(
- nb_workers, worker_init, (prepare_worker(use_memory_fs)(worker),),
- )
+ if use_memory_fs:
+ pool = Pool(
+ nb_workers, worker_init, (prepare_worker_with_memory_fs(worker),),
+ )
+ else:
+ pool = Pool(
+ nb_workers, worker_init, (prepare_worker_without_memory_fs(worker),),
+ )
map_result = pool.map_async(global_worker, workers_args)
pool.close()
I'm running into this as well. In case it's of any help, I think the issue may only occur when you are using the "spawn" start method of multiprocessing (i.e. multiprocessing.set_start_method("spawn")
) which is the default mode under Windows and macOS (at least in python 3.8 and newer since "fork" is unreliable on macOS--see https://bugs.python.org/issue33725).
Would be great to see a fix for this implemented.
I can reproduce this on Linux:
Python: 3.10.13 Pandarallel: 1.6.5 Pandas: 2.2.0 Numpy: 1.26.4
import numpy as np
import pandas as pd
import pandarallel
pandarallel.core.CONTEXT = pandarallel.core.multiprocessing.get_context('spawn')
pandarallel.pandarallel.initialize()
df = pd.DataFrame(np.random.rand(240).reshape(80,3),columns=list('abc'))
df['id'] = np.arange(80) % 10
df.groupby('id')[['a']].parallel_apply(lambda x: pd.DataFrame(np.array([x.values.flatten()]*2),columns=list('abcdefgh')))
This bug only appears for me when using the "spawn" start method.
I would like to run parallel on a Jupyter Notebook in AWS Sagemaker. However even in the most basic examples I get the following error message:
This was thrown when running