Open 962086838 opened 2 days ago
Forgot to append my ds_config.json
{
"train_batch_size": 96,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0002,
"betas": [
0.5,
0.999
],
"eps": 1e-8
}
},
"steps_per_print": 10,
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"stage3_gather_16bit_weights_on_model_save": true,
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 2e8,
"stage3_prefetch_bucket_size": 2e7,
"stage3_param_persistence_threshold": 1e6
}
}
The reason for writing such a customized model is that I can not use two deepspeed.initialize() functions to initialize these two models. A very similar issue can be found here https://github.com/microsoft/DeepSpeed/issues/3472#issuecomment-1574202568 and I tried to implement one of the suggestions.
Describe the bug In my own implementation, I combine a large language model and a speculator model. And my goal is to train the speculator model to make it better at predicting the n+2, n+3... tokens. I have read the doc of deepspeed, and I think it supports any customized model on top of nn.Module. But I have encountered CUDA OOM error when initializing the customized model with deepspeed.initialize().
To Reproduce Here is my main code
And the speculator.py is adapted from https://github.com/foundation-model-stack/fms-extras/blob/main/fms_extras/models/speculator.py
My command is
The error is (from one rank):
Expected behavior I find another code which suggest that a full Mistral-Large-2407 (a 123B model) can be trained on 12 8*80G GPUs using deepspeed zero 3. And there was much free GPU memory. So I think when I add this speculator, which is a very small customized model, there should not be CUDA OOM error.
ds_report output