Questions about batch size when pre-training BERT with NVIDIA dataset

tjdgh0715 commented 2 years ago

Hello DeepSpeed team, I am trying to reproduce BERT training with NVIDIA dataset, as I couldn't get the original dataset of Microsoft.

And when I checked the batch size to scale efficiently during pre-training, I found some strange data about it.

I didn't use some gradient accumulation, but I found that there was no OOM even with the batch size (batch size per gpu) 256. I think it seems very weird.

I used the BERT-base model with ADAM optimizer, which has sequence length 128 and learning rate 1e-4. Also, I used a single V100 gpu to train model.

I think that there is some point that I didn't recognize, which is critical to this issue. However, I am not sure about it.

Is there any advice about my issue? Or, is there some ways that I can get the original pre-processed wikipedia dataset that was used for pre-training BERT with DeepSpeed?

Thank you for your support!

p.s. The GPU memory consumption was about 8GB when micro-batch size is 64 / 13GB at 128 / 23GB at 256. It looks very weird to me.

conglongli commented 2 years ago

Not sure why you feel it's strange and it's difficult to conclude without seeing any code. However, for BERT-Large with our internal data, we are able to fit batch size 64 per GPU for seqlen 128 on 32GB V100. https://github.com/microsoft/DeepSpeedExamples/blob/1447dd2c0224234e42024c508078fab006f3209b/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json#L3 So your BERT-base's 128 batch size seems totally reasonable to me.

tjdgh0715 commented 2 years ago

Hello @conglongli, Thank you for your response to my question.

I thought it was strange since I couldn't find some experimental results with that kind of large batch size. I was trying to reproduce the 'Progressive Layer Dropping' with DeepSpeed. However, I found that batch size 16 per GPU was used to implement it with BERT-base (seqlen 128), as the higher batch size (e.g. 64) raised OOM.

I think that the internal dataset that was used to train BERT-large and used to train BERT-base with progressive layer dropping is the same (Wikipedia + bookcorpus).

According to that information, I am not sure why the setup of BERT-base raised OOM with batch size > 16, and why the setup of BERT-large was able to fit batch size 64 per GPU, even though the same dataset was used to train and there is difference between model's size.

I am wondering if there is something different in the setup that I didn't recognize. Can I ask you for some advice about this issue?

Also, I attached some configuration files to help understand my setup. This is my DeepSpeed configuration file for training BERT-base on 32GB V100.

{
  "train_batch_size": 128,
  "train_micro_batch_size_per_gpu": 128,
  "steps_per_print": 1,
  "prescale_gradients": true,
  "gradient_predivide_factor": 8,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-4,
      "weight_decay": 0.01,
      "bias_correction": false
    }
  },
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": true,
  "fp16": {
    "enabled": true,
    "loss_scale": 0
  },

  "tensorboard": {
    "enabled": true,
    "output_path": "/home/sungho/DeepSpeedExamples/data/tensorboard/single/base",
    "job_name": "base-200epoch_scaled"
  }

}

And this is my configuration of BERT-base.

{
    "name": "bing_bert_base_adam_seq",
    "bert_token_file": "bert-base-uncased",
    "bert_model_file": "bert-base-uncased",
    "bert_model_config": {
        "vocab_size_or_config_json_file": 119547,
        "hidden_size": 768,
        "num_hidden_layers": 12,
        "num_attention_heads": 12,
        "intermediate_size": 3072,
        "hidden_act": "gelu",
        "hidden_dropout_prob": 0.1,
        "attention_probs_dropout_prob": 0.1,
        "max_position_embeddings": 512,
        "type_vocab_size": 2,
        "initializer_range": 0.02
    },
    "data": {
        "flags": {
            "pretrain_dataset": true,
            "pretrain_type": "wiki_bc"
        },
        "mixed_seq_datasets": {
            "128": {
                "pretrain_dataset": "/home/sungho/DeepSpeedExamples/data/wikicorpus_en/train"
            },
            "512": {
                "pretrain_dataset": "data/512"
            }
        }
    },
    "mixed_seq_training": {
        "128": {
            "num_epochs": 200,
            "warmup_proportion": 0.02,
            "learning_rate": 1e-4,
            "num_workers": 0,
            "async_worker": true,
            "decay_rate": 0.99,
            "decay_step": 1000,
            "total_training_steps": 200000
        }
    },
    "validation": {
        "path": "/home/sungho/DeepSpeedExamples/data/wikicorpus_en/test"
    }
}

conglongli commented 2 years ago

@tjdgh0715 I saw in PLD tutorial that train_micro_batch_size_per_gpu 16 is used, but I think it does not mean that "more than 16 batch size will definitely OOM on 32GB V100". I confirmed with author of PLD @minjiaz that "I can't remember why I used batch size 16 per gpu, but that indeed does not indicate that a larger per gpu batch size is not possible.".

Regarding your script, if you want to reproduce PLD, make sure to use the same global batch size ("train_batch_size") since this will affect convergence. The micro batch size per GPU ("train_micro_batch_size_per_gpu") can be any number that fit on your GPU, and as long as you can reproduce similar results (e.g., GLUE scores), there is nothing to worry about why you don't get OOM on larger micro batch size, or whether the code is correct.

minjiaz commented 2 years ago

Thanks for answering the questions, @conglongli.

Hi Sungho,

What Conglong said makes sense. It is the global effective batch size instead of per GPU batch size that matters when it comes to the model quality and convergence. As long as you keep the global effective batch size to be the same (e.g., 4k), you can use a larger per GPU batch size if it does not cause OOM. Maybe some part of the CUDA runtime/PyTorch has been optimized such that now you can run with a larger per_gpu_batch_size? I would suggest first training with the same effective batch size and validating the GLUE results to make sure you get a similar downstream task accuracy using NVIDIA data.

tjdgh0715 commented 2 years ago

Thank you for the detailed response, @conglongli, @minjiaz. With your answer, I feel like I'm getting the hang of it now. Yes, the global effective batch size (e.g. 4k) is the dominant factor for convergence. If I can set the global batch size as used in PLD, and can get a similar downstream task accuracy (GLUE), then perhaps a larger batch size per GPU without OOM could not be the problem. (Since it can be changed by the optimizer like LAMB, and also empirical). The first step for me to investigate while reproducing PLD thoroughly seems to train with the same global batch size (4k), and validate the downstream task accuracy. I agree with your idea.

Your answer helped me a lot in scrutinizing my training system. Now I'm closing the issue. Thank you for your support!

microsoft / DeepSpeedExamples

Questions about batch size when pre-training BERT with NVIDIA dataset #195