hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.63k stars 4.33k forks source link

[BUG]: 使用llama2增量预训练失败 #4578

Open zryowen123 opened 1 year ago

zryowen123 commented 1 year ago

🐛 Describe the bug

我使用examples/language/llama2中的代码预训练llama2-70b。使用gemini.sh直接跑benchmark.py是成功的,但是我想基于训好的模型进行增量预训练,训练参数和gemini.sh中给出的参数一致,只是修改了如下代码读取已有的模型: with init_ctx:

model = LlamaForCausalLM(config)

    model = LlamaForCausalLM.from_pretrained(args.model_path)

然后跑gemini.sh报了OOM的错误: outputs = model(**batch)

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 247, in forward

outputs = self.module(*args, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward

outputs = self.model(

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward

layer_outputs = decoder_layer(

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 405, in forward

hidden_states = self.input_layernorm(hidden_states)

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 89, in forward

return self.weight * hidden_states.to(input_dtype)

File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/colo_parameter.py", line 63, in __torch_function__

new_args = ColoParamOpHookManager.pre_op(params, *args, *kwargs.values())

File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/param_op_hook.py", line 82, in pre_op

ColoParamOpHookManager._trigger_pre_forward(params)

File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/param_op_hook.py", line 63, in _trigger_pre_forward

hook.pre_forward(params)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_hook.py", line 47, in pre_forward

self.pre_op(params)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_hook.py", line 35, in pre_op

self._chunk_manager.access_chunk(chunk)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/manager.py", line 110, in access_chunk

self.__add_accessed_chunk(chunk)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/manager.py", line 246, in __add_accessed_chunk

chunk.access_chunk()

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/chunk.py", line 359, in access_chunk

self.__gather()

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/chunk.py", line 480, in __gather

dist.all_gather(gather_list, self.cuda_shard, self.torch_pg)

File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2277, in all_gather

work = group.allgather([tensor_list], [tensor])

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB (GPU 3; 79.33 GiB total capacity; 73.44 GiB already allocated; 469.81 MiB free; 77.47 GiB reserved in total by PyTorch) If reserved memory is > > allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

辛苦开发者看下这个是怎么回事

Environment

No response

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: [BUG]: Failed to use llama2 incremental pre-training

Fridge003 commented 1 year ago

您好,目前GeminiPlugin还不支持用from_pretrained加载模型。请保持model = LlamaForCausalLM(config) 这一行不变,在219行的booster.boost封装完模型以后,使用booster.load_model(model, model_checkpoint_path)来从存档点加载模型。我们会在不久后支持from_pretrained,请您关注我们的更新。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Hello, currently GeminiPlugin does not support loading models with from_pretrained. Please keep the line model = LlamaForCausalLM(config) unchanged. After booster.boost encapsulates the model in line 219, use booster.load_model(model, save_dir) to load the model from the save point. We will support from_pretrained soon, please pay attention to our updates.

zryowen123 commented 1 year ago

好的 , 谢谢开发者

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


ok thanks developer

huxiaosheng123 commented 1 year ago

可以参考:https://github.com/huxiaosheng123/open-llama2

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


You can refer to: https://github.com/huxiaosheng123/open-llama2

lulia0228 commented 1 year ago

您好,目前GeminiPlugin还不支持用from_pretrained加载模型。请保持model = LlamaForCausalLM(config) 这一行不变,在219行的booster.boost封装完模型以后,使用booster.load_model(model, model_checkpoint_path)来从存档点加载模型。我们会在不久后支持from_pretrained,请您关注我们的更新。

请问这里是加载hf格式的,还是其他格式?咱们这边checkpoint存储也是hf吗?

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Hello, currently GeminiPlugin does not support loading models with from_pretrained. Please keep the line model = LlamaForCausalLM(config) unchanged. After booster.boost encapsulates the model in line 219, use booster.load_model(model, model_checkpoint_path) to load the model from the archive point. We will support from_pretrained soon, please pay attention to our updates.

Is the hf format loaded here, or other formats? Is our checkpoint storage also hf?

Fridge003 commented 1 year ago

您好,目前GeminiPlugin还不支持用from_pretrained加载模型。请保持model = LlamaForCausalLM(config) 这一行不变,在219行的booster.boost封装完模型以后,使用booster.load_model(model, model_checkpoint_path)来从存档点加载模型。我们会在不久后支持from_pretrained,请您关注我们的更新。

请问这里是加载hf格式的,还是其他格式?咱们这边checkpoint存储也是hf吗?

您好,我们checkpoint的文件夹的格式和huggingface是一样的,只是目前有bug,需要用booster.load_model 才能加载模型。这个bug在这两天会被修复,之后就可以用from_pretrained 直接从我们的存档文件夹里加载模型了。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Hello, currently GeminiPlugin does not support loading models with from_pretrained. Please keep the line model = LlamaForCausalLM(config) unchanged. After booster.boost encapsulates the model in line 219, use booster.load_model(model, model_checkpoint_path) to load the model from the archive point. We will support from_pretrained soon, please pay attention to our updates.

May I ask if the hf format is loaded here, or other formats? Is our checkpoint storage also hf?

Hello, the format of our checkpoint folder is the same as that of huggingface, but there is currently a bug that requires booster.load_model to load the model. This bug will be fixed in these two days, and then we can use from_pretrained to load the model directly from our archive folder.

lulia0228 commented 1 year ago

您好,目前GeminiPlugin还不支持用from_pretrained加载模型。请保持model = LlamaForCausalLM(config) 这一行不变,在219行的booster.boost封装完模型以后,使用booster.load_model(model, model_checkpoint_path)来从存档点加载模型。我们会在不久后支持from_pretrained,请您关注我们的更新。

请问这里是加载hf格式的,还是其他格式?咱们这边checkpoint存储也是hf吗?

您好,我们checkpoint的文件夹的格式和huggingface是一样的,只是目前有bug,需要用booster.load_model 才能加载模型。这个bug在这两天会被修复,之后就可以用from_pretrained 直接从我们的存档文件夹里加载模型了。

Great!

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Hello, currently GeminiPlugin does not support loading models with from_pretrained. Please keep the line model = LlamaForCausalLM(config) unchanged. After booster.boost encapsulates the model in line 219, use booster.load_model(model, model_checkpoint_path) to load the model from the archive point. We will support from_pretrained soon, please pay attention to our updates.

Is the hf format loaded here, or other formats? Is our checkpoint storage also hf?

Hello, the format of our checkpoint folder is the same as that of huggingface, but there is currently a bug that requires booster.load_model to load the model. This bug will be fixed in these two days, and then we can use from_pretrained to load the model directly from our archive folder.

Great!

wangbluo commented 1 year ago

您好,目前GeminiPlugin还不支持用from_pretrained加载模型。请保持model = LlamaForCausalLM(config) 这一行不变,在219行的booster.boost封装完模型以后,使用booster.load_model(model, model_checkpoint_path)来从存档点加载模型。我们会在不久后支持from_pretrained,请您关注我们的更新。

请问这里是加载hf格式的,还是其他格式?咱们这边checkpoint存储也是hf吗?

您好,我们checkpoint的文件夹的格式和huggingface是一样的,只是目前有bug,需要用booster.load_model 才能加载模型。这个bug在这两天会被修复,之后就可以用from_pretrained 直接从我们的存档文件夹里加载模型了。

您好,这几天碰到了同样的问题,请问这个bug修复了吗

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Hello, currently GeminiPlugin does not support loading models using from_pretrained. Please keep the line model = LlamaForCausalLM(config) unchanged. After booster.boost on line 219 encapsulates the model, use booster.load_model(model, model_checkpoint_path) to load the model from the save point. We will support from_pretrained in the near future, please pay attention to our updates.

Is hf format loaded here or other formats? Is our checkpoint storage also hf?

Hello, the format of our checkpoint folder is the same as huggingface, but there is currently a bug and you need to use booster.load_model to load the model. This bug will be fixed in the next two days, and then you can use from_pretrained to load the model directly from our archive folder.

Hello, I have encountered the same problem these days. Has this bug been fixed?

flybird11111 commented 1 year ago

HiHi,When loading the model using from_pretrain, please disable lazyinit.

wangbluo commented 1 year ago

您好,我下载了最新的代码,我不确定我按照您的步骤有没有做对,希望您帮忙看一下。

我运行https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/llama2/pretrain.py,将里面的model改为从Huggingface的路径读取预训练模型,会一直oom

我是这样修改的:

image

前向的时候突然内存占用飙到100%然后崩了。

e4214ae15fdcc977236fd44bdc4b449f1011782f

我感到非常疑惑,因为我理解,加载预训练模型和随机初始化应该是一回事。期待您的回复。

一些可能会有帮助的日志: Loading checkpoint shards: 100%|██████████| 3/3 [00:15<00:00, 5.21s/it] Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /data0/pkg/Huggingface/CodeLlama-13b-Instruct-hf and are newly initialized: ['model.layers.33.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.37.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Booster init max CUDA memory: 6990.28 MB Booster init max CPU memory: 37040.80 MB Epoch 0: 0%| | 0/116315 [00:00<?, ?it/s]Traceback (most recent call last): File "/workspace/workfile/ColossalAI-main/examples/language/llama2/pretrain.py", line 281, in <module> main() File "/workspace/workfile/ColossalAI-main/examples/language/llama2/pretrain.py", line 256, in main outputs = model(**batch) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.9/site-packages/colossalai-0.3.2-py3.9.egg/colossalai/zero/gemini/gemini_ddp.py", line 246, in forward outputs = self.module(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward outputs = self.model( File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward layer_outputs = decoder_layer( File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 330, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) File "/usr/local/lib/python3.9/site-packages/colossalai-0.3.2-py3.9.egg/colossalai/tensor/colo_tensor.py", line 90, in __torch_function__ ret = func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 3; 79.20 GiB total capacity; 77.09 GiB already allocated; 127.56 MiB free; 77.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

flybird11111 commented 1 year ago

您的权重是多少b的呀?

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


What is your weight?

wangbluo commented 1 year ago

您的权重是多少b的呀?

您好,我是llama13b, batchsize设为1,我跑过您的pretrain.py是可以运行的,这个脚本13b的llama2模型可以batchsize跑到64

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


What is your weight?

Hello, I am llama13b. The batchsize is set to 1. I ran your pretrain.py and it can be run. The llama2 model of this script 13b can run the batchsize to 64.

flybird11111 commented 1 year ago

您的权重是多少b的呀?

您好,我是llama13b, batchsize设为1,我跑过您的pretrain.py是可以运行的,这个脚本13b的llama2模型可以batchsize跑到64

pretrain使用config预训练,使用了lazyinit,这可以优化内存的,但是lazyinit与hugging face的from_pretrain方法暂时不兼容。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


What is your weight?

Hello, I am llama13b, and the batchsize is set to 1. I ran your pretrain.py and it can be run. The llama2 model of this script 13b can run the batchsize to 64.

Pretrain uses config pretraining and lazyinit, which can optimize memory. However, lazyinit is temporarily incompatible with the hugging face's from_pretrain method.

wangbluo commented 1 year ago

您的权重是多少b的呀?

您好,我是llama13b, batchsize设为1,我跑过您的pretrain.py是可以运行的,这个脚本13b的llama2模型可以batchsize跑到64

pretrain使用config预训练,使用了lazyinit,这可以优化内存的,但是lazyinit与hugging face的from_pretrain方法暂时不兼容。

您好,正如我上面代码里显示的,我已经注释掉了with init_ctx: 我的模型不是通过lazyinit初始化的,您的意思是如果不通过lazyinit,就会导致oom吗?因为没有节省显存?

或者您有没有关于预训练模型的脚本呢。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


What is your weight?

Hello, I am llama13b, and the batchsize is set to 1. I ran your pretrain.py and it can be run. The llama2 model of this script 13b can run the batchsize to 64.

Pretrain uses config pretraining and lazyinit, which can optimize memory. However, lazyinit is temporarily incompatible with the from_pretrain method of hugging face.

Hello, as shown in my code above, I have commented out with init_ctx: My model is not initialized through lazyinit. Do you mean that if it is not passed through lazyinit, it will cause oom? Or do you have any scripts for pre-training models?

flybird11111 commented 1 year ago

您的权重是多少b的呀?

您好,我是llama13b, batchsize设为1,我跑过您的pretrain.py是可以运行的,这个脚本13b的llama2模型可以batchsize跑到64

pretrain使用config预训练,使用了lazyinit,这可以优化内存的,但是lazyinit与hugging face的from_pretrain方法暂时不兼容。

您好,正如我上面代码里显示的,我已经注释掉了with init_ctx: 我的模型不是通过lazyinit初始化的,您的意思是如果不通过lazyinit,就会导致oom吗?因为没有节省显存?

或者您有没有关于预训练模型的脚本呢。

这个就是预训练的脚本, 您是想进行微调吗?

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


What is your weight?

Hello, I am llama13b, and the batchsize is set to 1. I ran your pretrain.py and it can be run. The llama2 model of this script 13b can run the batchsize to 64.

pretrain uses config pretraining and lazyinit, which can optimize memory, but lazyinit is temporarily incompatible with the hugging face's from_pretrain method.

Hello, as shown in my code above, I have commented out with init_ctx: My model is not initialized through lazyinit. Do you mean that if it is not initialized through lazyinit, it will cause oom? Because there is no saving in video memory?

Or do you have any scripts for pre-training models?

This is the pre-trained script. Do you want to fine-tune it?

wangbluo commented 1 year ago

您的权重是多少b的呀?

您好,我是llama13b, batchsize设为1,我跑过您的pretrain.py是可以运行的,这个脚本13b的llama2模型可以batchsize跑到64

pretrain使用config预训练,使用了lazyinit,这可以优化内存的,但是lazyinit与hugging face的from_pretrain方法暂时不兼容。

您好,正如我上面代码里显示的,我已经注释掉了with init_ctx: 我的模型不是通过lazyinit初始化的,您的意思是如果不通过lazyinit,就会导致oom吗?因为没有节省显存? 或者您有没有关于预训练模型的脚本呢。

这个就是预训练的脚本, 您是想进行微调吗?

您好,我的意思是能够加载预训练模型,您的预训练脚本里,模型的加载不是with init_ctx: model = LlamaForCausalLM(config)这样的吗,我们希望直接使用Huggingface下载下来的预训练模型,目前看脚本里就差这个了

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


What is your weight?

Hello, I am llama13b. The batchsize is set to 1. I ran your pretrain.py and it can be run. The llama2 model of this script 13b can run the batchsize to 64.

pretrain uses config pretraining and lazyinit, which can optimize memory, but lazyinit is temporarily incompatible with the from_pretrain method of hugging face.

Hello, as shown in my code above, I have commented out with init_ctx: My model is not initialized through lazyinit. Do you mean that if it is not initialized through lazyinit, will it cause oom? Because there is no saving in video memory? Or do you have any scripts for pre-training models?

This is the pre-trained script. Do you want to fine-tune it?

Hello, what I mean is to be able to load the pre-trained model. In your pre-training script, the model is not loaded with init_ctx: model = LlamaForCausalLM(config). We hope to directly use the pre-trained model downloaded by Huggingface. Currently This is what's missing from the script

wangbluo commented 1 year ago

我是看到你们回复说,可以加载Huggingface下载下来的预训练模型了,类似这样

model = LlamaForCausalLM.from_pretrained( '/data0/pkg/Huggingface/CodeLlama-13b-Instruct-hf',cache_dir=None,torch_dtype=torch.float16),

而不是model = LlamaForCausalLM(config)这样的,

因为我们希望直接在Huggingface 的预训练模型基础上微调呀,

如果是model = LlamaForCausalLM(config)这样的话,不是要从头开始了嘛,

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I saw your reply that you can load the pre-trained model downloaded by Huggingface, similar to this

model = LlamaForCausalLM.from_pretrained( '/data0/pkg/Huggingface/CodeLlama-13b-Instruct-hf',cache_dir=None,torch_dtype=torch.float16),

Instead of model = LlamaForCausalLM(config),

Because we want to fine-tune directly based on Huggingface's pre-trained model,

If it's model = LlamaForCausalLM(config), don't you have to start from scratch?

Fridge003 commented 1 year ago

我是看到你们回复说,可以加载Huggingface下载下来的预训练模型了,类似这样

model = LlamaForCausalLM.from_pretrained( '/data0/pkg/Huggingface/CodeLlama-13b-Instruct-hf',cache_dir=None,torch_dtype=torch.float16),

而不是model = LlamaForCausalLM(config)这样的,

因为我们希望直接在Huggingface 的预训练模型基础上微调呀,

如果是model = LlamaForCausalLM(config)这样的话,不是要从头开始了嘛,

您好,目前from_pretrained应该没有太大问题了。如果还是OOM,我建议您使用gemini_auto.sh的参数试一下,因为gemini_auto.sh会自动监控GPU的内存使用量防止OOM。

如果继续用gemini.sh的话,可以在pretrain.py 的第156行初始化GeminiPlugin的时候传入offload_optim_fracoffload_param_frac两个参数,并且把他们的值调大。这两个参数可以在前向之前将部分模型参数和优化器状态offload到cpu里,避免gpu的OOM。

如果还是OOM,可以等一下我们这两天的一个PR #4673, 这个PR会提供Llama微调的脚本

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I saw your reply that you can load the pre-trained model downloaded by Huggingface, similar to this

model = LlamaForCausalLM.from_pretrained( '/data0/pkg/Huggingface/CodeLlama-13b-Instruct-hf',cache_dir=None,torch_dtype=torch.float16),

Instead of model = LlamaForCausalLM(config),

Because we want to fine-tune directly based on Huggingface’s pre-trained model,

If it is model = LlamaForCausalLM(config), don’t you have to start from scratch?

Hello, there should be no big problem with from_pretrained at present. If OOM still exists, I suggest you try using the parameters of gemini_auto.sh, because gemini_auto.sh will automatically monitor the memory usage of the GPU to prevent OOM.

If you continue to use gemini.sh, you can pass in the two parameters offload_optim_frac and offload_param_frac when initializing GeminiPlugin on line 156 of pretrain.py, and increase their values. These two parameters can offload some model parameters and optimizer status to the CPU before forwarding to avoid GPU OOM.

If it is still OOM, you can wait for our PR https://github.com/hpcaitech/ColossalAI/pull/4673 in the past two days. This PR will provide fine-tuned scripts.

wangbluo commented 1 year ago

您好,感谢提供的方案,这样确实不会oom了,期待您后续与lazyinit的兼容。

注:我们现在使用deepspeed做训练脚本,13b的llama模型可以跑到64个batchsize,外加梯度累计 = 8,这样跑下来是200s一个迭代,除8是25s;

gemini的话,直接用你们的预训练脚本,同样的配置大概是15-16s一个迭代,时间是要比deepspeed快上不少的。只是现在测起来还不能加载预训练模型,试了booster.load_model会有很奇怪的日志,类似这种:

image

如果解决这个和梯度累计的问题,我认为完全可以代替deepspeed了,非常期待您的后续更新。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Hello, thank you for the solution provided. This will indeed prevent oom. We look forward to your future compatibility with lazyinit.

Note: We are now using deepspeed as the training script. The 13b llama model can run to 64 batchsizes, plus gradient accumulation = 8. In this way, it takes 200s to run an iteration, except for 8 it is 25s;

For gemini, use your pre-training script directly. The same configuration will take about 15-16 seconds per iteration, which is much faster than deepspeed. It’s just that I can’t load the pre-trained model when I test it now. If I try booster.load_model, there will be a very strange log, similar to this:

image

If this and gradient accumulation problems are solved, I think it can completely replace deepspeed. I look forward to your follow-up updates.

FrankLeeeee commented 1 year ago

感谢您的建议,团队最近会逐步解决,如果有任何问题请及时提issue :)

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Thank you for your suggestion. The team will gradually solve it recently. If you have any questions, please raise an issue in time :)

Fridge003 commented 11 months ago

@wangbluo 您好,gemini对梯度累积的支持已经完成。 使用方法可以参考 docs/source/en/features/gradient_accumulation_with_booster.md(英文文档)或者docs/source/zh-Hans/features/gradient_accumulation_with_booster.md (中文文档),我们的在线tutorials也会在几天后更新

Issues-translate-bot commented 11 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@wangbluo Hello, gemini’s support for gradient accumulation has been completed. For usage methods, please refer to docs/source/en/features/gradient_accumulation_with_booster.md (English document) or docs/source/zh-Hans/features/gradient_accumulation_with_booster.md (Chinese document). Our online tutorials will also be available in a few days. diva update