huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.46k stars 892 forks source link

AzureML MPI Jobs and Accelerate - CUDA Out of Memory #1836

Closed FrsECM closed 11 months ago

FrsECM commented 12 months ago

System Info

- Accelerate version: 0.21.0
- Platform: Ubuntu 20.04.6 LTS (Focal Fossa)
- Python version: Python 3.8.17
- PyTorch version (GPU?): 2.0-cuda11.7 (True)

Information

Tasks

Reproduction

Steps to reproduce the behavior : 1 - Create a running training job without accelerate 2 - Run it in AzureML with a commandJob (sdk v2.0) 3 - Modify the training job in order to be compatible with accelerate 4 - Edit the job's yaml file in order to run it with MPI (It's the only compatible way to launch an accelerate job in AzureML to what i know)

At the beggining of the job, i create my dataset and prepare things to run in MultiGPU :

# train_nlp_model.py
# After model and dataset instanciation
get_gpu_memory('Before Prepare')
with os.popen(cmd="nvidia-smi") as cmd:
    print(cmd.read())
model, optimizer, train_loader,val_loader, scheduler = accelerator.prepare(
            model, optimizer, train_loader,val_loader, scheduler,device_placement=device_placements)
get_gpu_memory('After Prepare')

If i run it on a single node without MPI :

jobs:
  train_nlp_model:
    code: ../../../ # Relative from yaml file in order to get the root directory.
    command: >-
      python jobs/training/train_nlp_model/train_nlp_model.py
...

It runs correctly and consume only 714MB of GPU Memory.

Now if i had parameters in order to distribute the training with MPI : ```yaml jobs: train_nlp_model: resources: instance_count: 2 shm_size: 16g distribution: type: mpi # Necessary to work with accelerate. https://github.com/huggingface/accelerate#launching-multi-cpu-run-using-mpi process_count_per_instance: 1 code: ../../../ # Relative from yaml file in order to get the root directory. command: >- python jobs/training/train_nlp_model/train_nlp_model.py ... ``` **As you can see, i face an OutOfMemory error when i prepare my model/dataloader/etc....** Of course, every single parameters in both jobs are the same excepted MPI. **Do you know what can be the issue ?** Is it possible to run accelerate job with torchrun ? ### Expected behavior The job should run the same way with multiple node and on a single node in AzureML. Thanks, Regards
sgugger commented 12 months ago

You can run your job with torchrun which is what Accelerate does behind the scenes. To investigate what went wrong, we need you to provide your accelerate config and the way you are launching your script.

FrsECM commented 12 months ago

I use the default configuration i don't use any specific configuration for the moment. I'm just getting started with accelerate. Is it required to create a specific one ?

With accelerate config i only saw AWS and local machine then i prefered setting nothing.

The job is run on a cluster. I'll do another try in order to use pytorch distributed configuration in AzureML.

Behind the scene, i think AzureML is runing mpirun. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2

I use the same except that i configure my job with a yaml file regarding this schema : https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json

I know that when using pytorch, azureml set environment variables for each machine and so on that's why i'll try torchrun.

FrsECM commented 12 months ago

It seems that changing the yaml file fixed the first problem !

jobs:
  train_nlp_model:
    resources:
      instance_count: 2
      shm_size: 16g
    distribution:
      type: pytorch # Should work https://github.com/huggingface/accelerate/issues/1836#issuecomment-1674451105
      process_count_per_instance: 1
    code: ../../../ # Relative from yaml file in order to get the root directory.
    command: >-
      python jobs/training/train_nlp_model/train_nlp_model.py
...

But it generates another one : image

It seems that when (and only when) we are in MultiGPU, there is an error because of an inplace operation :

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 50]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
3d61bf228f9c4ef587f2aecd4d30901d000003:15:15 [0] NCCL INFO comm 0x14c98dc95010 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE

I tried running the job locally with a small GPU to debug with the context to detect anomaly... There is no anomaly detected...

Here is the piece of code that is involved :

with torch.autograd.set_detect_anomaly(True):
    with accelerator.autocast():
        optimizer.zero_grad()
        token_1,token_2 = batch
        # We compute embeddings
        emb_anchor = torch.mean(model(**token_1).last_hidden_state,dim=1)
        emb_pos = torch.mean(model(**token_2).last_hidden_state,dim=1)
        if norm_embedding:
            emb_anchor = emb_anchor/emb_anchor.norm(p=2,dim=1)[...,None]
            emb_pos = emb_pos/emb_pos.norm(2,dim=1)[...,None]
        contrastive_loss=loss(emb_anchor,emb_pos)
        accelerator.backward(contrastive_loss)
        optimizer.step()

Locally :

Destination Directory - tmp/SBERT_CONTRASTIVE
Epoch 1/10 ----------------------
Training - SBERT:   0%|                                          | 2/21695 [00:12<37:24:47,  6.21s/it, ContrastiveLoss=1.03]
FrsECM commented 12 months ago

It seems that the first problem if fixed by moving to pytorch instead of mpi. But the second one is more treaky. I have a siamese network and it seems that i face theses issues from transformers library :

I have no idea about how to fix it, and it's very hard to debug as it is something specific to multigpu implementations.

FrsECM commented 11 months ago

It seems i've found a solution for the second issue... I need to :

It seems that in multiGPU, pytorch is less tolerant to unused model parameters what is my case because i use the last_hidden_layer of the bert model.

Here is an example of what it does :

# Siamese model encapsulating 

# We create a unique module (important for MultiGPU)
# https://github.com/pytorch/pytorch/issues/62474#issuecomment-918891965

class SiameseBert(nn.Module):
    def __init__(self,bert_model:BertModel,loss:ContrastiveRollingLoss,norm_embedding:bool=True):
        nn.Module.__init__(self)
        self._bert:BertModel=bert_model
        self.norm_embedding:bool = norm_embedding
        self._loss = loss
    def forward(self,token_1,token_2)->Tuple[torch.Tensor,torch.Tensor]:
        emb_1 = torch.mean(self._bert(**token_1).last_hidden_state,dim=1)
        emb_2 = torch.mean(self._bert(**token_2).last_hidden_state,dim=1)
        if self.norm_embedding:
            emb_1 = emb_1/emb_1.norm(p=2,dim=1)[...,None]
            emb_2 = emb_2/emb_2.norm(2,dim=1)[...,None]
        loss = self._loss(emb_1,emb_2)
        return emb_1,emb_2,loss
model = SiameseBert(bert_model,contrastive_loss,norm_embedding)

# Accelerate Initialisation with kwargs
accelerator  = Accelerator(mixed_precision='fp16',kwargs_handlers=[DistributedDataParallelKwargs(find_unused_parameters=True)])