Closed xinghaow99 closed 3 days ago
I'm not sure if I fully understand your issue, but would it help if you run this only on one process? I.e.:
if accelerator.is_main_process():
# your code
@BenjaminBossan Thank you for your reply.
I tried:
if accelerator.is_main_process:
batch_outputs = batch_outputs.to('cpu')
for j in batch_outputs:
layer_outputs.append(j)
But the whole process hangs, probably because the layer_outputs
list (which should be a list of tensors on cpu) is empty except for the main process.
Hmm, not sure why it would hang in that case. Hopefully someone else has a good idea here.
I managed to solve my problem with native pytorch.distributed operations. Thanks for the help!
It hangs because we need to declare the catchup point.
Or:
if accelerator.is_main_process:
batch_outputs = batch_outputs.to('cpu')
for j in batch_outputs:
layer_outputs.append(j)
accelerator.wait_for_everyone()
Would love to know your solution though @xinghaow99 and I can translate it over to what should work in accelerate :)
Hi @muellerzr .
I used torch.distributed.scatter_object_list()
instead of accelerator.split_between_processes()
, and I refactored my script to avoid using gather()
. I'm new to Accelerate and not sure if the problem is with Accelerate.
I suspect that the problem arises when using gather()
, as it syncs the object across all devices, the memory consumption on each device is increased by 8x (lets say there are 8 processes). After calling .cpu()
, these objects on different devices use separate space in CPU memory, thus taking up 8x the space they should be. And this maybe can be avoided by using torch.distributed.gather_object()
? (I'm not sure)
I hope this is informative to you.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Hi! I'm doing multi-gpu inference of a specific layer of a model, and trying to save the IOs. To save gpu RAM, I only move the tensors to gpus when computing. Here is my code:
The issue is when
batch_outputs = batch_outputs.to('cpu')
happens, it takesaccelerator.num_processes
times memory than expected.Expected behavior
I want to save the gathered inference outputs into
layer_outputs
list only once. The memory usage should be same aslayer_inputs
.