Memory and parallelism tuning

nalepae / pandarallel

A simple and efficient tool to parallelize Pandas operations on all available CPUs

https://nalepae.github.io/pandarallel

BSD 3-Clause "New" or "Revised" License

3.59k stars 208 forks source link

Memory and parallelism tuning #230

Open jamessmith123456 opened 1 year ago

jamessmith123456 commented 1 year ago

（1）It seems that memory issues cannot be solved when there is a large amount of data. （2）If the parallelism is 20, the original data will be copied in 20 copies？（3）How can I solve the coordination relationship between memory and CPU to set the optimal parameters，please?

nalepae commented 1 year ago

(1): Pandarallel basically doubles the amount of needed memory, as stated in the documentation:

pandarallel gets around this limitation by using all cores of your computer. But, in return, pandarallel need twice the memory that standard pandas operation would normally use.

(2): No, the original data will be copied only once, whatever the parallelism.

(3): There is no coordination relationship between CPU and memory (cf (2))

SysuJayce commented 1 year ago

(1): Pandarallel basically doubles the amount of needed memory, as stated in the documentation:

pandarallel gets around this limitation by using all cores of your computer. But, in return, pandarallel need twice the memory that standard pandas operation would normally use.

(2): No, the original data will be copied only once, whatever the parallelism.

(3): There is no coordination relationship between CPU and memory (cf (2))

hi @nalepae , if the amount of data is quite large, how can we boost the preparation before apply()?

If I have 100GB data read in memory, I have to wait a long time before the apply start

nalepae commented 5 months ago

Pandaral·lel is looking for a maintainer! If you are interested, please open an GitHub issue.

shermansiu commented 2 months ago

@SysuJayce, what do you mean by "boosting the preparation"?

If you are memory-bound, I would suggest breaking up your dataframe into smaller shards and applying your function to each shard.

Do you have any other problems? If not, I would like to close this issue.