uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.38k stars 89 forks source link

Pool.imap runs indefinitely on a Windows machine #211

Open lucazav opened 3 years ago

lucazav commented 3 years ago

I'm trying to parallelize the row wise Pandas dataframe's apply() function, as I reported in this Stackoverflow question. Following the hint of albert, I run the following code using a conda environment with Python 3.9.1 64-bit on a Windows machine:

import pandas as pd
import time
from pathos.multiprocessing import Pool

def enrich_str(str):

    val1 = f'{str}_1'
    val2 = f'{str}_2'
    val3 = f'{str}_3'
    time.sleep(3)

    return val1, val2, val3

def enrich_row(row_tuple):
    passed_row = row_tuple[1]
    col_name = str(passed_row['colName'])
    my_string = str(passed_row[col_name])

    val1, val2, val3 = enrich_str(my_string)

    passed_row['enriched1'] = val1
    passed_row['enriched2'] = val2
    passed_row['enriched3'] = val3

    return passed_row

df = pd.DataFrame({'numbers': [1, 2, 3, 4, 5], 'colors': ['red', 'white', 'blue', 'orange', 'red']}, 
                  columns=['numbers', 'colors'])

df['colName'] = 'colors'

tic = time.perf_counter()
result = Pool(8).imap(enrich_row, df.iterrows(), chunksize=1)
enriched_df = pd.DataFrame(result)
toc = time.perf_counter()

print(f"{enriched_df.shape[0]} rows enriched in {toc - tic:0.4f} seconds")
print(enriched_df)

Unfortunately it runs indefinitely on my machine using all the cores at 100%. Any hint?

mmckerns commented 3 years ago

@lucazav: I tested this on a mac, and it works for me. So, I'm going to assume that it's a windows issue. Python spawns processes on Windows differently then on other systems, and there are a few workarounds when things get stuck on Windows.

On Windows, you should generally use pathos.helpers.freeze_support(), which requires a if __name__ == '__main__': block of code. There's also multiprocess.set_start_method to change the character of the Pool, but I don't have a lot of experience with that function on windows, so I'm not sure if it's as functional as it is on a mac. I'd sit that aside for now. Going back to freeze_support, if you find it throws an error once freeze_support is added, then the next natural step would either be to use set_start_method (to change how pools are created), or to use dill.settings['recurse'] = True (to change how objects are serialized).

Give freeze_support a try.