Closed FrsECM closed 11 months ago
You can run your job with torchrun
which is what Accelerate does behind the scenes. To investigate what went wrong, we need you to provide your accelerate config and the way you are launching your script.
I use the default configuration i don't use any specific configuration for the moment. I'm just getting started with accelerate. Is it required to create a specific one ?
With accelerate config i only saw AWS and local machine then i prefered setting nothing.
The job is run on a cluster. I'll do another try in order to use pytorch distributed configuration in AzureML.
Behind the scene, i think AzureML is runing mpirun
.
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2
I use the same except that i configure my job with a yaml file regarding this schema : https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
I know that when using pytorch, azureml set environment variables for each machine and so on that's why i'll try torchrun
.
It seems that changing the yaml file fixed the first problem !
jobs:
train_nlp_model:
resources:
instance_count: 2
shm_size: 16g
distribution:
type: pytorch # Should work https://github.com/huggingface/accelerate/issues/1836#issuecomment-1674451105
process_count_per_instance: 1
code: ../../../ # Relative from yaml file in order to get the root directory.
command: >-
python jobs/training/train_nlp_model/train_nlp_model.py
...
But it generates another one :
It seems that when (and only when) we are in MultiGPU, there is an error because of an inplace operation :
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 50]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
3d61bf228f9c4ef587f2aecd4d30901d000003:15:15 [0] NCCL INFO comm 0x14c98dc95010 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
I tried running the job locally with a small GPU to debug with the context to detect anomaly... There is no anomaly detected...
Here is the piece of code that is involved :
with torch.autograd.set_detect_anomaly(True):
with accelerator.autocast():
optimizer.zero_grad()
token_1,token_2 = batch
# We compute embeddings
emb_anchor = torch.mean(model(**token_1).last_hidden_state,dim=1)
emb_pos = torch.mean(model(**token_2).last_hidden_state,dim=1)
if norm_embedding:
emb_anchor = emb_anchor/emb_anchor.norm(p=2,dim=1)[...,None]
emb_pos = emb_pos/emb_pos.norm(2,dim=1)[...,None]
contrastive_loss=loss(emb_anchor,emb_pos)
accelerator.backward(contrastive_loss)
optimizer.step()
Locally :
Destination Directory - tmp/SBERT_CONTRASTIVE
Epoch 1/10 ----------------------
Training - SBERT: 0%| | 2/21695 [00:12<37:24:47, 6.21s/it, ContrastiveLoss=1.03]
It seems that the first problem if fixed by moving to pytorch
instead of mpi
.
But the second one is more treaky.
I have a siamese network and it seems that i face theses issues from transformers library :
I have no idea about how to fix it, and it's very hard to debug as it is something specific to multigpu implementations.
It seems i've found a solution for the second issue... I need to :
It seems that in multiGPU, pytorch is less tolerant to unused model parameters what is my case because i use the last_hidden_layer of the bert model.
Here is an example of what it does :
# Siamese model encapsulating
# We create a unique module (important for MultiGPU)
# https://github.com/pytorch/pytorch/issues/62474#issuecomment-918891965
class SiameseBert(nn.Module):
def __init__(self,bert_model:BertModel,loss:ContrastiveRollingLoss,norm_embedding:bool=True):
nn.Module.__init__(self)
self._bert:BertModel=bert_model
self.norm_embedding:bool = norm_embedding
self._loss = loss
def forward(self,token_1,token_2)->Tuple[torch.Tensor,torch.Tensor]:
emb_1 = torch.mean(self._bert(**token_1).last_hidden_state,dim=1)
emb_2 = torch.mean(self._bert(**token_2).last_hidden_state,dim=1)
if self.norm_embedding:
emb_1 = emb_1/emb_1.norm(p=2,dim=1)[...,None]
emb_2 = emb_2/emb_2.norm(2,dim=1)[...,None]
loss = self._loss(emb_1,emb_2)
return emb_1,emb_2,loss
model = SiameseBert(bert_model,contrastive_loss,norm_embedding)
# Accelerate Initialisation with kwargs
accelerator = Accelerator(mixed_precision='fp16',kwargs_handlers=[DistributedDataParallelKwargs(find_unused_parameters=True)])
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Steps to reproduce the behavior : 1 - Create a running training job without accelerate 2 - Run it in AzureML with a commandJob (sdk v2.0) 3 - Modify the training job in order to be compatible with accelerate 4 - Edit the job's yaml file in order to run it with MPI (It's the only compatible way to launch an accelerate job in AzureML to what i know)
At the beggining of the job, i create my dataset and prepare things to run in MultiGPU :
If i run it on a single node without MPI :
It runs correctly and consume only 714MB of GPU Memory.