hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.8k stars 4.35k forks source link

[FEATURE]: Deepspeed不开offload 与Colossalai Gemini对llama2_13b的测试对比 #4719

Open wangbluo opened 1 year ago

wangbluo commented 1 year ago

Describe the feature

尊敬的开发者们你们好,前几天测试了你们关于llama2 gemini-auto和Deepspeed zero3+offload的性能,今天测试了一下gemini和Deepspeed两边都不开offload的一个性能情况,想跟你们分享一下。

测试脚本分别是pretrain.py和Deepspeed的Trainer训练框架,Deepspeed版本为0.10.0

测试模型是llama2 13b,sequence length为2048,pre device batch size都是8,梯度累积Deepspeed设为1,采用zero3

我测试是从一个step的开始打点一个time,到backward结束打点,计算一个step time的时间差。

colossalai gemini pretrain.py的step_time:13.5s,显存占用:52834MB, 而benchmark.py的steptime:9.8s左右,显存占用:39998MB (ps:还在看这二者之间的区别,不懂为什么step time和显存占用有一定差距)

Deepspeed的step time:24-25s,显存占用:51438MB 如果Deepspeed开了offload,那么step_time为27s左右,显存占用为26445MB

也就是说,colossalai gemini 开offload 和不开offload的step_time和显存占用,都要优于Deepspeed的开offload 和不开offload的step time和显存占用情况。

因为看到你们给的https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/llama2 readme里面,给的是Deepspeed zero3+offload和colossalai的对比情况,并未给出二者都不开offload 的对比,所以想和你们确认一下这个情况。

我理解Deepspeed zero3+offload比不过colossalai是很正常的,毕竟他们的offload不是auto的,但是我不确定二者都不开offload的一个对比,你们有没有测过这种情况呢,可以找一个不会oom的配置。

如果以上测得没有问题的话我觉得是非常棒的。

其中ds_config配置为:

{
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 0,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 0,
        "stage3_max_reuse_distance": 0,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": "auto",
            "betas": [
                0.9,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 0
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "last_batch_iteration": -1,
            "total_num_steps": "auto",
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: [FEATURE]: Comparison of Deepspeed without offload and Colossalai Gemini's test of llama2_13b

FrankLeeeee commented 1 year ago

Hi 您好,感谢您的测试,我们做这个benchmark的同学会尽快回复您 :)

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Hi, thank you for your test, our classmates who are doing this benchmark will get back to you as soon as possible :)

wangbluo commented 1 year ago

@FrankLeeeee @flybird11111 我还想分享一种场景,大家都知道colossalai一直致力于内存管理,并且在这个方面做的非常好,这对于一些个人爱好者和资金有限的公司来说是很有用的。但是对于另外一些不缺机器的公司,他们训练的第一选择是一点offload也不开,不够了就多开几个卡,越快越好,然后为了训练效果和时间的选择,一般采用1024个的batchsize。不行就加机器或者梯度累计,但是offload不是一个优先选择。

如果我以上测试没有问题的话,我也建议您可以在readme里加上这种不开offload场景的对比。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@FrankLeeeee @flybird11111 I also want to share a scenario. Everyone knows that colossalai has been committed to memory management and has done a very good job in this aspect. This is very useful for some individual enthusiasts and companies with limited funds. But for other companies that are not short of machines, their first choice for training is not to open any offload. If it is not enough, open a few more cards. The faster the better. Then in order to select the training effect and time, 1024 is generally used. batchsize. If that doesn't work, add a machine or gradient accumulation, but offloading is not a priority.

If there are no problems with my above test, I also suggest that you add this kind of comparison without offloading in the readme.

FrankLeeeee commented 1 year ago

@FrankLeeeee @flybird11111 我还想分享一种场景,大家都知道colossalai一直致力于内存管理,并且在这个方面做的非常好,这对于一些个人爱好者和资金有限的公司来说是很有用的。但是对于另外一些不缺机器的公司,他们训练的第一选择是一点offload也不开,不够了就多开几个卡,越快越好,然后为了训练效果和时间的选择,一般采用1024个的batchsize。不行就加机器或者梯度累计,但是offload不是一个优先选择。

如果我以上测试没有问题的话,我也建议您可以在readme里加上这种不开offload场景的对比。

Sounds awesome! 感谢您的建议

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@FrankLeeeee @flybird11111 I also want to share a scenario. Everyone knows that colossalai has been committed to memory management and has done a very good job in this area. This is very useful for some individual enthusiasts and companies with limited funds. But for other companies that do not lack machines, their first choice for training is not to open any offload. If it is not enough, open a few more cards. The faster the better. Then for the purpose of training effect and time selection, 1024 cards are generally used. batchsize. If that doesn't work, add a machine or gradient accumulation, but offloading is not a priority.

If there are no problems with my above test, I also suggest that you add this kind of comparison without offloading in the readme.

Sounds awesome! Thanks for the suggestion

ver217 commented 1 year ago

在我们的集群上(512x A100 40G),对于llama2-70B,Deepspeed不开offload没有跑成功(即便开了flash attn 2和grad checkpoint),所以只对比了ColossalAI不offload(不OOM的前提最快)与Deepspeed offload optimizer(不OOM的前提最快)的情况。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


On our cluster (512x A100 40G), for llama2-70B, Deepspeed did not run successfully without offloading (even if flash attn 2 and grad checkpoint were turned on), so we only compared the situation of the offload optimizer.

wangbluo commented 1 year ago

在我们的集群上(512x A100 40G),对于llama2-70B,Deepspeed不开offload没有跑成功(即便开了flash attn 2和grad checkpoint),所以只对比了ColossalAI不offload(不OOM的前提最快)与Deepspeed offload optimizer(不OOM的前提最快)的情况。

好的,非常感谢您的回复,我注意到了您提到的oom问题。我想向您解释一下,我测试deepspeed和colossalai两边都不开offload的原因,是因为我们的训练场景下,一般是多张80G 八卡A100一起训练,一次训练的batchsize是1024。遇到oom的问题一般是考虑增加机器,而不是开offload,而且我相信其他不缺少机器的公司很可能也是这么做的,毕竟开启offload一定会降低该框架的step_time。因此在我们的训练场景下,能不开offload就不开。

所以关注每个框架的sota性能,并进行比较,对于我们来说非常重要,我们希望能得到一个更加general的结论。

两边任何一方开启offload和另外一方不开启的比较,都是不公平的,也比较不出来我们在意的sota性能。当然也很高兴看到colossalai在两边都不开启offload的情况下,显存和step_time都有一个明显的优势。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


On our cluster (512x A100 40G), for llama2-70B, Deepspeed did not run successfully without offloading (even if flash attn 2 and grad checkpoint were turned on), so we only compared ColossalAI without offload (the premise without OOM is the fastest ) and Deepspeed offload optimizer (fastest without OOM).

OK, thank you very much for your reply, I noticed the oom issue you mentioned. I would like to explain to you that the reason why I did not turn on offload when testing deepspeed and colossalai is because in our training scenario, we usually train multiple 80G eight-card A100s together, and the batch size for one training is 1024. When encountering OOM problems, you usually consider adding more machines instead of turning on offload, and I believe that other companies that do not lack machines are likely to do the same. After all, turning on offload will definitely reduce the step_time of the framework. Therefore, in our training scenario, offload is not enabled if possible.

When the hardware is sufficient and the model is not that big, such as the llama13b I mentioned above and the batch size of 64, it is unfair to compare whether one of the two parties has offload enabled and the other does not. Of course, it is also great to see that colossalai has a clear advantage in both video memory and step_time in this case.

flybird11111 commented 1 year ago

在我们的集群上(512x A100 40G),对于llama2-70B,Deepspeed不开offload没有跑成功(即便开了flash attn 2和grad checkpoint),所以只对比了ColossalAI不offload(不OOM的前提最快)与Deepspeed offload optimizer(不OOM的前提最快)的情况。

好的,非常感谢您的回复,我注意到了您提到的oom问题。我想向您解释一下,我测试deepspeed和colossalai两边都不开offload的原因,是因为我们的训练场景下,一般是多张80G 八卡A100一起训练,一次训练的batchsize是1024。遇到oom的问题一般是考虑增加机器,而不是开offload,而且我相信其他不缺少机器的公司很可能也是这么做的,毕竟开启offload一定会降低该框架的step_time。因此在我们的训练场景下,能不开offload就不开。

所以关注每个框架的sota性能,并进行比较,对于我们来说非常重要,我们希望能得到一个更加general的结论。

两边任何一方开启offload和另外一方不开启的比较,都是不公平的,也比较不出来我们在意的sota性能。当然也很高兴看到colossalai在两边都不开启offload的情况下,显存和step_time都有一个明显的优势。

好的,感谢您的建议~

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


On our cluster (512x A100 40G), for llama2-70B, Deepspeed did not run successfully without offloading (even if flash attn 2 and grad checkpoint were turned on), so we only compared ColossalAI without offload (the prerequisite of not OOM is the most Fast) and Deepspeed offload optimizer (fastest without OOM).

Ok, thank you very much for your reply, I noticed the oom problem you mentioned. I would like to explain to you that the reason why I did not turn on offload when testing deepspeed and colossalai is because in our training scenario, we usually train multiple 80G eight-card A100s together, and the batch size for one training is 1024. When encountering OOM problems, you usually consider adding more machines instead of turning on offload, and I believe that other companies that do not lack machines are likely to do the same. After all, turning on offload will definitely reduce the step_time of the framework. Therefore, in our training scenario, offload is not enabled if possible.

So it is very important for us to pay attention to the sota performance of each framework and compare it. We hope to get a more general conclusion.

It is unfair to compare whether one of the two parties has offload turned on and the other party does not, and it cannot compare the sota performance that we care about. Of course, I am also happy to see that colossalai has a clear advantage in both video memory and step_time when offload is not enabled on both sides.

OK, thank you for your suggestion~