nalepae / pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs
https://nalepae.github.io/pandarallel
BSD 3-Clause "New" or "Revised" License
3.59k stars 208 forks source link

Memory and parallelism tuning #230

Open jamessmith123456 opened 1 year ago

jamessmith123456 commented 1 year ago

(1)It seems that memory issues cannot be solved when there is a large amount of data. (2)If the parallelism is 20, the original data will be copied in 20 copies? (3)How can I solve the coordination relationship between memory and CPU to set the optimal parameters,please?

nalepae commented 1 year ago

(1): Pandarallel basically doubles the amount of needed memory, as stated in the documentation:

pandarallel gets around this limitation by using all cores of your computer. But, in return, pandarallel need twice the memory that standard pandas operation would normally use.

(2): No, the original data will be copied only once, whatever the parallelism.

(3): There is no coordination relationship between CPU and memory (cf (2))

SysuJayce commented 1 year ago

(1): Pandarallel basically doubles the amount of needed memory, as stated in the documentation:

pandarallel gets around this limitation by using all cores of your computer. But, in return, pandarallel need twice the memory that standard pandas operation would normally use.

(2): No, the original data will be copied only once, whatever the parallelism.

(3): There is no coordination relationship between CPU and memory (cf (2))

hi @nalepae , if the amount of data is quite large, how can we boost the preparation before apply()?

If I have 100GB data read in memory, I have to wait a long time before the apply start

nalepae commented 5 months ago

Pandaral·lel is looking for a maintainer! If you are interested, please open an GitHub issue.

shermansiu commented 2 months ago

@SysuJayce, what do you mean by "boosting the preparation"?

If you are memory-bound, I would suggest breaking up your dataframe into smaller shards and applying your function to each shard.

Do you have any other problems? If not, I would like to close this issue.