nalepae / pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs
https://nalepae.github.io/pandarallel
BSD 3-Clause "New" or "Revised" License
3.64k stars 210 forks source link

Weird return from parallel_appy() #111

Open conraddd opened 3 years ago

conraddd commented 3 years ago

Seems the function is applied to the same row more than 1 time when number of workers(2) is close to number of rows(2)

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=2, use_memory_fs=False)

df = pd.DataFrame([{"a": 123, "b": 222}, {"a": 121, "b": 567}])
print(df)

def test(row):
    row["a"] -= 1
    print("hello world")
    return row

print(df.parallel_apply(test, axis=1))

"hello world" and subtraction are performed twice per row?

     a    b
0  123  222
1  121  567
hello world
hello world
hello world
hello world
     a    b
0  121  222
1  119  567
BrannonKing commented 3 years ago

While Pandarallel may have some indexing issue causing a row to be hit twice, you should also note that modifying rows is not a thread-safe operation. Your example code is invalid because of that.

vmarar commented 2 years ago

you should do this instead

def test(row): row["a"] -= 1 print("hello world") return row['a']

df['a'] = df.parallel_apply(test, axis=1)

shermansiu commented 5 months ago

Cannot reproduce.

Python: 3.10.13 Pandarallel: 1.6.5 Pandas: 2.2.0

Setup:

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=2, use_memory_fs=False)

The code

df = pd.DataFrame([{"a": 123, "b": 222}, {"a": 121, "b": 567}])
print(df)

def test(row):
    row["a"] -= 1
    print("hello world")
    return row

print(df.apply(test, axis=1))

returns

     a    b
0  123  222
1  121  567
hello world
hello world
     a    b
0  122  222
1  120  567

Whereas

df = pd.DataFrame([{"a": 123, "b": 222}, {"a": 121, "b": 567}])
print(df)

def test(row):
    row["a"] -= 1
    print("hello world")
    return row

print(df.apply(test, axis=1))

returns

     a    b
0  123  222
1  121  567
hello worldhello world

     a    b
0  122  222
1  120  567

Aside from the difference in how "hello world" is printed, the output is the same.