Open CWHer opened 1 year ago
I think the first issue about gpt2-ddp and llama can be resolved when replace our forked transformers
with the latest official transformers
.
Error of modified
train_prompts.py
The combinations are,
- [x]
gpt2-colossalai_gemini
opt-colossalai_gemini
llama-colossalai_gemini
roberta-colossalai_gemini
RuntimeError: CUDA error: invalid argument
Fixed by adding the following assert to ColossalAI/applications/Chat/coati/trainer/ppo.py
.
if isinstance(strategy, ColossalAIStrategy):
from colossalai.booster.plugin import GeminiPlugin
assert not (isinstance(strategy.plugin, GeminiPlugin) and offload_inference_models), \
"GeminiPlugin is not compatible with manual model.to('cpu')"
I think the first issue about gpt2-ddp and llama can be resolved when replace our forked
transformers
with the latest officialtransformers
.
The error of gpt2-ddp
remains even with official transformers
lib (4.31.0.dev0).
As for the errors of LLAMA, they are caused by incorrect values of args.pretrain.
I believe setting a proper path can solve this problem.
- [x]
roberta-naive
roberta-ddp
roberta-colossalai_gemini
roberta-colossalai_zero2
CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when callingcublasCreate(handle)
Remove roberta
support.
Errors of LLAMA are fixed by removing the following code snippet.
🐛 Describe the bug
Description
Some combinations of arguments lead to errors of
train_prompts.py
.Details
Error of
train_prompts.py
These errors can be reproduced by modify
test_ci.sh
inColossalAI/applications/Chat/examples
.The combinations are,
gpt2-ddp
Earlier reported by #3421.
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation.
llama-naive
llama-ddp
llama-colossalai_gemini
llama-colossalai_zero2
Repository Not Found for url: https://huggingface.co/{...}/resolve/main/tokenizer.model.
roberta-naive
roberta-ddp
roberta-colossalai_gemini
roberta-colossalai_zero2
CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling
cublasCreate(handle)
Error of modified
train_prompts.py
These errors can be reproduced through the following script.
The combinations are,
gpt2-colossalai_gemini
opt-colossalai_gemini
llama-colossalai_gemini
roberta-colossalai_gemini
RuntimeError: CUDA error: invalid argument
Environment
PyTorch
: 1.13.1Colossal-AI
: commitb3ab7fbabf
Transformers
: commit61f79b2986