[chat]: bugs of Coati's train_prompts.py

CWHer commented 1 year ago

🐛 Describe the bug

Description

Some combinations of arguments lead to errors of train_prompts.py.

Details

Error of train_prompts.py

These errors can be reproduced by modify test_ci.sh in ColossalAI/applications/Chat/examples.

The combinations are,
- [ ] gpt2-ddp
Earlier reported by #3421.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation.
- [x] llama-naive llama-ddp llama-colossalai_gemini llama-colossalai_zero2
```
# FIXME: this causes the error
tokenizer = LlamaTokenizer.from_pretrained(args.pretrain)
```
Repository Not Found for url: https://huggingface.co/{...}/resolve/main/tokenizer.model.
- [x] roberta-naive roberta-ddp roberta-colossalai_gemini roberta-colossalai_zero2
CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

Error of modified train_prompts.py

These errors can be reproduced through the following script.

import argparse

from coati.models.bloom import BLOOMActor
from coati.models.gpt import GPTActor
from coati.models.llama import LlamaActor
from coati.models.opt import OPTActor
from coati.models.roberta import RoBERTaActor
from coati.trainer.strategies import ColossalAIStrategy

from colossalai.nn.optimizer import HybridAdam

def main(args):
  initializer_dict = {
      'gpt': lambda: GPTActor(),
      'bloom': lambda: BLOOMActor(),
      'opt': lambda: OPTActor(),
      'llama': lambda: LlamaActor(),
      'roberta': lambda: RoBERTaActor(),
  }
  initializer = initializer_dict[args.model]
  strategy = ColossalAIStrategy(stage=3, placement_policy='cuda', initial_scale=2**5)

  with strategy.model_init_context():
      # configure model
      actor = initializer()

  # configure optimizer
  actor_optim = HybridAdam(actor.parameters(), lr=1e-7)

  (actor, actor_optim) = strategy.prepare((actor, actor_optim))

  try:
      # FIXME: this causes the error
      actor.to("cpu")
      print(f"[SUCCESS]: {strategy.unwrap_model(actor).__class__.__name__}")
  except RuntimeError as e:
      print(f"[ERROR]: {strategy.unwrap_model(actor).__class__.__name__}")
      # raise e

if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument('--model', type=str, default='gpt',
                      choices=['gpt', 'bloom', 'opt', 'llama', 'roberta'])
  args = parser.parse_args()
  main(args)

set -xe

set_n_least_used_CUDA_VISIBLE_DEVICES() {
    local n=${1:-"9999"}
    echo "GPU Memory Usage:"
    local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv |
        tail -n +2 |
        nl -v 0 |
        tee /dev/tty |
        sort -g -k 2 |
        awk '{print $1}' |
        head -n $n)
    export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
    echo "Now CUDA_VISIBLE_DEVICES is set to:"
    echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
}

set_n_least_used_CUDA_VISIBLE_DEVICES 4

export CUDA_LAUNCH_BLOCKING=1
for model in 'gpt' 'bloom' 'opt' 'llama' 'roberta'; do
    torchrun --standalone --nproc_per_node=4 reproduce_error.py --model $model
done

The combinations are,

[x] gpt2-colossalai_gemini opt-colossalai_gemini llama-colossalai_gemini roberta-colossalai_gemini

RuntimeError: CUDA error: invalid argument

Environment

PyTorch: 1.13.1
Colossal-AI: commit b3ab7fbabf
Transformers: commit 61f79b2986

ver217 commented 1 year ago

I think the first issue about gpt2-ddp and llama can be resolved when replace our forked transformers with the latest official transformers.

CWHer commented 1 year ago

Error of modified train_prompts.py

The combinations are,

[x] gpt2-colossalai_gemini opt-colossalai_gemini llama-colossalai_gemini roberta-colossalai_gemini RuntimeError: CUDA error: invalid argument

Fixed by adding the following assert to ColossalAI/applications/Chat/coati/trainer/ppo.py.

if isinstance(strategy, ColossalAIStrategy):
    from colossalai.booster.plugin import GeminiPlugin
    assert not (isinstance(strategy.plugin, GeminiPlugin) and offload_inference_models), \
        "GeminiPlugin is not compatible with manual model.to('cpu')"

CWHer commented 1 year ago

I think the first issue about gpt2-ddp and llama can be resolved when replace our forked transformers with the latest official transformers.

The error of gpt2-ddp remains even with official transformers lib (4.31.0.dev0).

CWHer commented 1 year ago

As for the errors of LLAMA, they are caused by incorrect values of args.pretrain.

https://github.com/hpcaitech/ColossalAI/blob/31dc302017ff491a36088dd27ed4c76e11d5b5b7/applications/Chat/examples/train_prompts.py#L126-L127

~~I believe setting a proper path can solve this problem.~~

CWHer commented 1 year ago

[x] roberta-naive roberta-ddp roberta-colossalai_gemini roberta-colossalai_zero2 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

Remove roberta support.

CWHer commented 1 year ago

Errors of LLAMA are fixed by removing the following code snippet.

https://github.com/hpcaitech/ColossalAI/blob/edd75a59eada232a7d093b070e4ec7bfd81f31c3/applications/Chat/examples/train_prompts.py#L132-L135

hpcaitech / ColossalAI