RAM usage scales when more GPUs are used

libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)

https://ffcv.io

Apache License 2.0

2.82k stars 178 forks source link

RAM usage scales when more GPUs are used #120

Closed chengxuz closed 2 years ago

chengxuz commented 2 years ago

I notice that when my training process uses more GPUs, the RAM usage also increases, almost linearly w.r.t. the gpu number. Specifically, the 2-gpu training uses around 85GB and the 4-gpu training uses around 170GB, my dataset is 82GB and I am doing in_memory=True. Is this behavior expected?

One difference I find between how my training process and the ffcv-imagenet way of doing the training is that I am starting my process through something like python -m torch.distributed.launch --nproc_per_node=4 to start my script, while the ffcv-imagenet training script does something like ch.multiprocessing.spawn(cls._exec_wrapper, nprocs=world_size, join=True) within the script while the script is run just like a normal python script, is this the reason?

GuillaumeLeclerc commented 2 years ago

Hello @chengxuz .

This is a known bug that has been fixed. It will be part of the next release but you can already use it by using the branch v0.0.4

chengxuz commented 2 years ago

Hi @GuillaumeLeclerc ,

All my tests were actually performed in branch v0.0.4 (as I used that to fix the GPU memory bug). So the fix does not seem to work for me.

GuillaumeLeclerc commented 2 years ago

Thanks for your answer. Are you using our imagenet example or did you write your own? Two things:

Try with v0.0.4 and our imagenet example, that's what I tested
If it works there and not in yours my suspicion is that you are missing ch.cuda.set_device in your code. FFCV uses the current device for the pipeline regardless of where you copy the tensor as you might want to prepare the data with a GPU and use another for training.

Keep us updated!

chengxuz commented 2 years ago

There must be some misunderstanding here. I think you were talking about an earlier bug I reported in here.

But here I am talking about the computer memory (RAM, am I using the wrong term?) not GPU memory. I don't understand why you mention the cuda set_device here, it's in my mind not related to the RAM usage at all. It should be related to how you allocate the memory to get the dataset in (the in_memory choice).

GuillaumeLeclerc commented 2 years ago

Oh sorry. I Indeed misunderstood. I don't know what you mean by in memory but the behavior is expected with os_cache=False. Indeed each process (one per GPU caches the data it needs).

If using os_cache=False. Make sure you use quasi random ordering as otherwise you will have a copy of the dataset in each process memory.

if you use os_cache=True. Do you divide your batch size when you use multiple GPUs. The memory used by FFCV is proportional to the amount of data being prepared. If you use two GPU with a batch size of 128 it will use twice the amount of ram than a single GPU with batch size=128

GuillaumeLeclerc commented 2 years ago

Closing due to inactivity