hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.82k stars 4.35k forks source link

[chat]: bugs of Coati's train_prompts.py #4023

Open CWHer opened 1 year ago

CWHer commented 1 year ago

🐛 Describe the bug

Description

Some combinations of arguments lead to errors of train_prompts.py.

Details

Environment

ver217 commented 1 year ago

I think the first issue about gpt2-ddp and llama can be resolved when replace our forked transformers with the latest official transformers.

CWHer commented 1 year ago
  • Error of modified train_prompts.py

    The combinations are,

    • [x] gpt2-colossalai_gemini opt-colossalai_gemini llama-colossalai_gemini roberta-colossalai_gemini RuntimeError: CUDA error: invalid argument

Fixed by adding the following assert to ColossalAI/applications/Chat/coati/trainer/ppo.py.

if isinstance(strategy, ColossalAIStrategy):
    from colossalai.booster.plugin import GeminiPlugin
    assert not (isinstance(strategy.plugin, GeminiPlugin) and offload_inference_models), \
        "GeminiPlugin is not compatible with manual model.to('cpu')"
CWHer commented 1 year ago

I think the first issue about gpt2-ddp and llama can be resolved when replace our forked transformers with the latest official transformers.

The error of gpt2-ddp remains even with official transformers lib (4.31.0.dev0).

image
CWHer commented 1 year ago

As for the errors of LLAMA, they are caused by incorrect values of args.pretrain.

https://github.com/hpcaitech/ColossalAI/blob/31dc302017ff491a36088dd27ed4c76e11d5b5b7/applications/Chat/examples/train_prompts.py#L126-L127

I believe setting a proper path can solve this problem.

CWHer commented 1 year ago
  • [x] roberta-naive roberta-ddp roberta-colossalai_gemini roberta-colossalai_zero2 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

Remove roberta support.

CWHer commented 1 year ago

Errors of LLAMA are fixed by removing the following code snippet.

https://github.com/hpcaitech/ColossalAI/blob/edd75a59eada232a7d093b070e4ec7bfd81f31c3/applications/Chat/examples/train_prompts.py#L132-L135