Open conraddd opened 3 years ago
While Pandarallel may have some indexing issue causing a row to be hit twice, you should also note that modifying rows is not a thread-safe operation. Your example code is invalid because of that.
you should do this instead
def test(row): row["a"] -= 1 print("hello world") return row['a']
df['a'] = df.parallel_apply(test, axis=1)
Cannot reproduce.
Python: 3.10.13 Pandarallel: 1.6.5 Pandas: 2.2.0
Setup:
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=2, use_memory_fs=False)
The code
df = pd.DataFrame([{"a": 123, "b": 222}, {"a": 121, "b": 567}])
print(df)
def test(row):
row["a"] -= 1
print("hello world")
return row
print(df.apply(test, axis=1))
returns
a b
0 123 222
1 121 567
hello world
hello world
a b
0 122 222
1 120 567
Whereas
df = pd.DataFrame([{"a": 123, "b": 222}, {"a": 121, "b": 567}])
print(df)
def test(row):
row["a"] -= 1
print("hello world")
return row
print(df.apply(test, axis=1))
returns
a b
0 123 222
1 121 567
hello worldhello world
a b
0 122 222
1 120 567
Aside from the difference in how "hello world" is printed, the output is the same.
Seems the function is applied to the same row more than 1 time when number of workers(2) is close to number of rows(2)
"hello world" and subtraction are performed twice per row?