Closed HebaGamalElDin closed 1 year ago
Hello @HebaGamalElDin, please provide minimal reproducible example for us to deep dive and help you.
Hello @pacman100, I'm fine tuning a transformer model from the hub of huggingface.. below is the training function that utilizes the accelerator on sagemaker training jobs.
def train(context: Context, num_epochs):
model = context.model
model = accelerator.prepare(model)
optimizer = AdamW(model.parameters(), lr=1e-3)
num_training_steps = num_epochs * len(context.train_dataloader)
lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=100, num_training_steps=num_training_steps)
optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(optimizer, context.train_dataloader, context.val_dataloader, lr_scheduler)
losses = []
min_cer = 1.0
min_train_loss = 1.0
for epoch in range(num_epochs):
model.train()
for j, batch in enumerate(train_dataloader):
inputs: torch.Tensor = batch["input"]#.to(accelerator.device)
labels: torch.Tensor = batch["label_tensor"]#.to(accelerator.device)
outputs = model(pixel_values=inputs, labels=labels)
#print(outputs)
loss = outputs.loss
accelerator.backward(loss)
#loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
losses.append(loss)
accelerator.print(f"Epoch {epoch}-------Batch---{j}-----Loss---{loss}")
model.eval()
for i, batch in enumerate(eval_dataloader):
inputs: torch.Tensor = batch["input"]#.to(accelerator.device)
with torch.no_grad():
predictions = accelerator.unwrap_model(model).generate(inputs)
generated_ids = accelerator.gather(predictions).cpu().numpy()
print(f"Generated IDs: {generated_ids}")
labels = accelerator.gather(batch["label_tensor"]).cpu().numpy()
generated_text = context.processor.batch_decode(generated_ids, skip_special_tokens=True)
labels_text = context.processor.batch_decode(labels, skip_special_tokens=True)
predictions, labels = postprocess_text(generated_text, labels_text)
cer_metric.add_batch(predictions=predictions, references=labels)
wer_metric.add_batch(predictions=predictions, references=labels)
print(f"Predictions: {predictions}-----------Labels: {labels}")
cer = cer_metric.compute()
wer = wer_metric.compute()
accelerator.print(f"Average CER: {cer}------ Average WER: {wer}")
the python estimator is as follows:
from sagemaker.pytorch import PyTorch
import sagemaker
role = sagemaker.get_execution_role()
pt_estimator = PyTorch(
base_job_name="transformer-ocr-training",
source_dir="source",
entry_point="Train.py",
role=role,
py_version="py38",
image_uri ="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04",
#framework_version="1.12.0",
instance_count=1,
instance_type="ml.p3.16xlarge"
#distribution={'smdistributed':{'dataparallel':{ 'enabled': True }}}
)
pt_estimator.fit("s3://handwritten-ocr-training")
Exactly when I'm generate in for the evaluation set it always retrieves empty tensors. What am I missing here?
However the number of processes is 8 GPUs so the accelerate has access to all of them however it's not generating in the validation all decoded strings are empty, appreciate your help!
Hello @HebaGamalElDin, you are not using the 🤗 Accelerate integration of AWS SageMaker correctly. To help you and others going forwards, I have spent time creating this repo https://github.com/pacman100/accelerate-aws-sagemaker which details on how to correctly use AWS SageMaker with 🤗 Accelerate. it works correctly with generation model.generate
. Please go through the README and files in the above repo and let us know if you still have issues.
Hello @pacman100 .. Thank you for the warm help.
I have one question please, what I didn't get is how to configure accelerate inside the training job?
meaning where to run the command accelerate config --config_file accelerate_config.yaml
? Have the accelerate_config.yaml file should been replaced the python SDK estimator PyTorch in my case?
Hello, you don't have to use any SageMaker estimator (PyTorch estimator in your case) as Accelerate internally uses Hugging Face SageMaker Estimator https://github.com/huggingface/accelerate/blob/main/src/accelerate/commands/launch.py#L776 along with all the necessary env variables to handle SageMaker DDP.
Just create the accelerate config with command accelerate config
on any virtual machine/local machine/sagemaker notebooks on which you have aws cli installed with aws credentials setup. After that when you run accelerate launch
it will internally use HF estimator to create the training job on AWS SageMaker. I am running accelerate config
and accelerate launch
on a local machine with aws credentials setup.
@pacman100 Okay I got that thank you. One more question please, I'm encountering an issue when I'm testing, most of validation batches entirely are empty while some others are okay, this problem doesn't happen while training is on 1 GPU, What could be the problem here please?! HINT: I'm logging the length of the text predictions coming by model.generate() for each batch, the majority is zero as shown in the below screenshot.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Sagemaker Multi-GPU distributed data training, while "model.generate" it always returns empty tensors.
Expected behavior