Have accelerate for Distributed Training: Data Parallelism feature working on AWS Sagemaker yet?

HebaGamalElDin commented 2 years ago

System Info

pytorch: 1.10.2
python:3.8

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

Sagemaker Multi-GPU distributed data training, while "model.generate" it always returns empty tensors.

Expected behavior

I'm trying to run a distributed training in a Sagemaker training job, the inference is not working properly, I found it as a future work on huggingface documentation so I'm wondering If that's why it's not working yet on sagemaker Multi-GPU.

Thanks

pacman100 commented 2 years ago

Hello @HebaGamalElDin, please provide minimal reproducible example for us to deep dive and help you.

HebaGamalElDin commented 2 years ago

Hello @pacman100, I'm fine tuning a transformer model from the hub of huggingface.. below is the training function that utilizes the accelerator on sagemaker training jobs.

def train(context: Context, num_epochs):
    model = context.model
    model = accelerator.prepare(model)
    optimizer = AdamW(model.parameters(), lr=1e-3)

    num_training_steps = num_epochs * len(context.train_dataloader)
    lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=100, num_training_steps=num_training_steps)
    optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(optimizer, context.train_dataloader, context.val_dataloader, lr_scheduler)

    losses = []
    min_cer = 1.0
    min_train_loss = 1.0
    for epoch in range(num_epochs):
        model.train()
        for j, batch in enumerate(train_dataloader):
            inputs: torch.Tensor = batch["input"]#.to(accelerator.device)
            labels: torch.Tensor = batch["label_tensor"]#.to(accelerator.device)

            outputs = model(pixel_values=inputs, labels=labels)
            #print(outputs)
            loss = outputs.loss
            accelerator.backward(loss)
            #loss.backward()

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            losses.append(loss)
            accelerator.print(f"Epoch {epoch}-------Batch---{j}-----Loss---{loss}")

        model.eval()
        for i, batch in enumerate(eval_dataloader):
            inputs: torch.Tensor = batch["input"]#.to(accelerator.device)
            with torch.no_grad():
                predictions = accelerator.unwrap_model(model).generate(inputs)

                generated_ids = accelerator.gather(predictions).cpu().numpy()
                print(f"Generated IDs: {generated_ids}")
                labels = accelerator.gather(batch["label_tensor"]).cpu().numpy()

                generated_text = context.processor.batch_decode(generated_ids, skip_special_tokens=True)
                labels_text = context.processor.batch_decode(labels, skip_special_tokens=True)

                predictions, labels = postprocess_text(generated_text, labels_text)

                cer_metric.add_batch(predictions=predictions, references=labels)
                wer_metric.add_batch(predictions=predictions, references=labels)
                print(f"Predictions: {predictions}-----------Labels: {labels}")
        cer = cer_metric.compute()
        wer = wer_metric.compute()

        accelerator.print(f"Average CER: {cer}------ Average WER: {wer}")

the python estimator is as follows:

from sagemaker.pytorch import PyTorch
import sagemaker
role = sagemaker.get_execution_role()
pt_estimator = PyTorch(
    base_job_name="transformer-ocr-training",
    source_dir="source",
    entry_point="Train.py",
    role=role,    
    py_version="py38",

    image_uri ="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04",

    #framework_version="1.12.0",

    instance_count=1,
    instance_type="ml.p3.16xlarge"
    #distribution={'smdistributed':{'dataparallel':{ 'enabled': True }}}
)

pt_estimator.fit("s3://handwritten-ocr-training")

Exactly when I'm generate in for the evaluation set it always retrieves empty tensors. What am I missing here?

HebaGamalElDin commented 2 years ago

However the number of processes is 8 GPUs so the accelerate has access to all of them however it's not generating in the validation all decoded strings are empty, appreciate your help!

pacman100 commented 2 years ago

Hello @HebaGamalElDin, you are not using the 🤗 Accelerate integration of AWS SageMaker correctly. To help you and others going forwards, I have spent time creating this repo https://github.com/pacman100/accelerate-aws-sagemaker which details on how to correctly use AWS SageMaker with 🤗 Accelerate. it works correctly with generation model.generate. Please go through the README and files in the above repo and let us know if you still have issues.

HebaGamalElDin commented 2 years ago

Hello @pacman100 .. Thank you for the warm help. I have one question please, what I didn't get is how to configure accelerate inside the training job? meaning where to run the command accelerate config --config_file accelerate_config.yaml? Have the accelerate_config.yaml file should been replaced the python SDK estimator PyTorch in my case?

pacman100 commented 2 years ago

Hello, you don't have to use any SageMaker estimator (PyTorch estimator in your case) as Accelerate internally uses Hugging Face SageMaker Estimator https://github.com/huggingface/accelerate/blob/main/src/accelerate/commands/launch.py#L776 along with all the necessary env variables to handle SageMaker DDP.

Just create the accelerate config with command accelerate config on any virtual machine/local machine/sagemaker notebooks on which you have aws cli installed with aws credentials setup. After that when you run accelerate launch it will internally use HF estimator to create the training job on AWS SageMaker. I am running accelerate config and accelerate launch on a local machine with aws credentials setup.

HebaGamalElDin commented 2 years ago

@pacman100 Okay I got that thank you. One more question please, I'm encountering an issue when I'm testing, most of validation batches entirely are empty while some others are okay, this problem doesn't happen while training is on 1 GPU, What could be the problem here please?! HINT: I'm logging the length of the text predictions coming by model.generate() for each batch, the majority is zero as shown in the below screenshot.

huggingface / accelerate