nebuly-ai / optimate

A collection of libraries to optimise AI model performances
https://www.nebuly.com/
Apache License 2.0
8.37k stars 643 forks source link

[Chatllama] Supervised Finetune on llama-7B #244

Open TonyZhanghm opened 1 year ago

TonyZhanghm commented 1 year ago

Hi! I downloaded the SHP dataset and was trying to run the actor training. I ran into several issues here with vanilla python, torchrun, and deepspeed.

TonyZhanghm commented 1 year ago

For python artifacts/main.py artifacts/config/config_new.yaml --type ACTOR, I got errors ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set for different ENV variables RANK, WORLD_SIZE, MASTER_ADDR. Wondering why would setup_model_parallel() be called if running the training on a single GPU?

TonyZhanghm commented 1 year ago

Then I tried torchrun artifacts/main.py artifacts/config/config_new.yaml --type ACTOR which would set up at the ENV variables. but got nan training loss

image
TonyZhanghm commented 1 year ago

Also tried deepspeed artifacts/main.py artifacts/config/config_new.yaml --type ACTOR but got the assertion error below. Since there's no shardings for llama-7B checkpoints, does it mean world_size can only be 1?

Traceback (most recent call last):
  File "artifacts/main.py", line 54, in <module>
    actor_trainer = ActorTrainer(config.actor)
  File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/rlhf/actor.py", line 292, in __init__
    self.model = ActorModel(config)
  File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/rlhf/actor.py", line 54, in __init__
    self.model, self.tokenizer = load_model(
  File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/llama_model.py", line 598, in load_model
    checkpoint, params = load_checkpoints(ckpt_dir, local_rank, world_size)
  File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/llama_model.py", line 576, in load_checkpoints
    assert world_size == len(checkpoints), (
AssertionError: Loading a checkpoint for MP=1 but world size is 8
diegofiori commented 1 year ago

Hello @TonyZhanghm, thank you very much for reaching out. I'll investigate the errors you are getting. Could you please share with us the config.yaml file you are currently using?

cmnfriend commented 1 year ago

I also ran into the issue that the training loss stayed nan... I am really looking forwards to your solutions ;)

young-chao commented 1 year ago

Then I tried torchrun artifacts/main.py artifacts/config/config_new.yaml --type ACTOR which would set up at the ENV variables. but got nan training loss image

May I ask your GPU memory specifications, I tested on the A10 and there will be a problem of cuda memory overflow.

TonyZhanghm commented 1 year ago

@diegofiori Here's the config, didn't change much but filling the weights and downloaded data

---
trainer_config:
  # learning rates
  actor_lr: 0.00001
  critic_lr: 0.00001
  # PPO Hyperparameters
  actor_eps_clip: 0.2
  critic_eps_clip: 0.2
  beta_s: 0.1
  # path to examples to be sampled (training dataset) see rlhf_dataset.json
  examples_path: "./SHP_datasets/rlhf_training_data.json"
  # number of episodes and generation performed for each episode
  # in the train() method
  num_episodes: 100
  max_timesteps: 32
  # number of timesteps after which the learn() method is called 
  # (to update the weights)
  update_timesteps: 32
  # number of example sampled at each timestep
  num_examples: 32
  # batch and epochs for the training
  batch_size: 1
  epochs: 1
  # number of learning steps (i.e. learn()) after which a checkpoint is saved
  update_checkpoint: 8
  checkpoint_folder: "./models/checkpoints"

actor_config:
  model: "llama-7B"
  model_path: "/persist/hzhang/llama_ckpt/7B/"
  checkpoint_folder: "./models"
  tokenizer_folder: "/persist/hzhang/llama_ckpt/tokenizer.model"
  train_dataset_path: "./SHP_datasets/actor_training_data.json"
  validation_dataset_path: null
  # froze model embedding during training
  froze_embeddings: True
  # use fairscale layers to build the model instead of vanilla pytorch
  use_fairscale: False
  # max sequence length for the actor (i.e. prompt + completion) it depends on
  # the model used.
  max_sequence_length: 1024
  # max tokens generated by the actor (completion only)
  max_tokens: 512
  # temperature for the actor
  temperature: 0.9
  batch_size: 1
  # number iteration after print
  iteration_per_print: 100
  lr: 0.0001
  epochs: 32
  # deepspeed settings
  deepspeed_enable: False
  deepspeed_config_path: "/persist/hzhang/nebullvm/apps/accelerate/chatllama/artifacts/config/ds_config.json"
TonyZhanghm commented 1 year ago

Then I tried torchrun artifacts/main.py artifacts/config/config_new.yaml --type ACTOR which would set up at the ENV variables. but got nan training loss image

May I ask your GPU memory specifications, I tested on the A10 and there will be a problem of cuda memory overflow.

I was on A100 80GB, with default batch size 1

PierpaoloSorbellini commented 1 year ago

Hi @TonyZhanghm thanks for your input, we are debugging all your issues, a more stable version will be out soon. We are currently struggling to support all models LLaMA + HF.

Ageliss commented 1 year ago

Have the same Qs: image