finetuning without trainer seems not data parallel when using deepspeed?

wenlai-lavine commented 2 years ago

Hi, I am trying to using deepspeed to finetune a model, but it seems the data are not parallel during the deepspeed?

I have wrote a toy code to repro, using 100 sentences with a batch_size=4, so the dataloader size is 25 when using one GPU; when I try to using multi-gpus, the dataloader size is still 25, which means we still need to do the loop in 25 times. I mean that, when we are using multiple GPUs, shouldn't the data be parallel? such as using 5 GPU here, does it mean we only need to do the loop 5 times? not 25 times?

I've only recently started using deepspeed and do not familiar with deepspeed and sorry for the easy question, hope someone can give me some suggestions. @stas00

see as follows:

import numpy as np
from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.optim import Adam
import deepspeed
from transformers.deepspeed import HfDeepSpeedConfig
from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer, DataCollatorForSeq2Seq

ds_config = {
    "gradient_accumulation_steps": 1,
    "train_batch_size": 4,
    "train_micro_batch_size_per_gpu": 4,
    "wall_clock_breakdown": True,
    "steps_per_print": 2000,

    "fp16": {
        "enabled": False,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

   "zero_optimization": {
       "stage": 2,
       "allgather_partitions": True,
       "allgather_bucket_size": 500000000,
       "overlap_comm": True,
       "reduce_scatter": True,
       "reduce_bucket_size": 500000000,
       "contiguous_gradients": False,
       "cpu_offload": True
   },

   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },
   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 3e-5,
       "warmup_num_steps": 500
     }
   }
}

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="en", tgt_lang="ro")

raw_datasets = load_dataset('wmt16', 'ro-en')
source_lang = 'en'
target_lang = 'ro'

def preprocess_function(examples):
    inputs = [ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=128, padding=True, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, padding=True, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_datasets = raw_datasets['train'].select(range(100))
train_dataset = train_datasets.map(
                preprocess_function,
                batched=True,
                remove_columns=raw_datasets["train"].column_names,
                desc="Running tokenizer on train dataset",
            )

label_pad_token_id = -100
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
)

train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=data_collator, batch_size=4)
dschf = HfDeepSpeedConfig(ds_config)
deepspeed.init_distributed()

model_engine, optimizer, _, _ = deepspeed.initialize(model=model, model_parameters=model.parameters(), config_params=ds_config)

all_loss = []

for batch_step, batch in enumerate(train_dataloader):
    print(str(batch_step))
    batch['input_ids'] = batch['input_ids'].cuda()
    batch['attention_mask'] = batch['attention_mask'].cuda()
    batch['labels'] = batch['labels'].cuda()
    outputs = model_engine(**batch)
    loss = outputs.loss
    print(loss)
    all_loss.append(loss)
    model_engine.backward(loss)
    model_engine.step()

print('total: ' + np.mean(all_loss))

code using:
- python test.py when using one GPU.
- deepspeed --num_gpus=5 test.py and set train_batch_size to 20 when using 5 GPUs.

stas00 commented 2 years ago

Hi @lavine-lmu,

Your query needs to be asked at the https://github.com/microsoft/DeepSpeed side since you're not using the HF/DS integration but writing your own training loop.

Also, you don't need HfDeepSpeedConfig unless you use ZeRO-3. That's the only time its functionality is used to tell from_pretrained to load the model directly on gpus.

To give you hint, your dataloader is unware of DDP - so you either need to use a deepspeed dataloader or to code it properly for DDP. You can see how this is done in the HF Trainer here:

https://github.com/huggingface/transformers/blob/8f3ea7a1e1a85e80210b3d4423b674d9a61016ed/src/transformers/trainer.py#L677-L684

wenlai-lavine commented 2 years ago

@stas00 Thanks, problem solved after I use the deepspeed dataloader.

huggingface / transformers

finetuning without trainer seems not data parallel when using deepspeed? #16125