hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.74k stars 4.34k forks source link

[BUG]: ColossalChat performance reproduction FAILED #5066

Open wzj423 opened 10 months ago

wzj423 commented 10 months ago

🐛 Describe the bug

I was trying to reproduce the benchmark results on https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/README.md which says:

DeepSpeedChat performance comes from its blog on 2023 April 12, ColossalChat performance can be reproduced on an AWS p4d.24xlarge node with 8 A100-40G GPUs with the following command: torchrun --standalone --nproc_per_node 8 benchmark_opt_lora_dummy.py --num_collect_steps 1 --use_kernels --strategy colossalai_zero2 --experience_batch_size 64 --train_batch_size 32

Cloning your latest main branch, I ran the benchmark with the same command, 8 A800-80g GPUs and the script simply skips the training process after the inference step has finished.

Here's what I've run:

srun torchrun --standalone --nproc_per_node 8 benchmark_opt_lora_dummy.py \
    --num_collect_steps 1 \
    --use_kernels \
    --strategy colossalai_zero2 \
    --experience_batch_size 64 \
    --train_batch_size 32\
    --model 1.3b\
    &> $LOG_FILE

I have add more logging outputs at coati/trainer/ppo.py line 253:

            if isinstance(self.dataloader.sampler, DistributedSampler):
                self.dataloader.sampler.set_epoch(update_step)
            pbar = tqdm(self.dataloader, desc=f"Train epoch [{update_step + 1}]", disable=not is_rank_0())
            print(f"On rank {get_rank()} dataloader len={len(self.dataloader)}\n\tdata_buffer len={len(self.data_buffer)}")
            for experience in pbar:
                self._on_learn_batch_start()
                experience.to_device(self.device)
                self._training_step(experience)
                self._on_learn_batch_end(experience)

The log shows that there is not enough data to make a single batch in the dataloader so the trainning process is not executed at all.

Collect steps:   0%|          | 0/1 [00:00<?, ?it/s]

Collect steps: 100%|██████████| 1/1 [00:10<00:00, 10.02s/it]
Collect steps: 100%|██████████| 1/1 [00:10<00:00, 10.02s/it]
Update steps:   0%|          | 0/1 [00:00<?, ?it/s]

On rank 0 dataloader len=0                          
    data_buffer len=64

Train epoch [1]: 0it [00:00, ?it/s]
Train epoch [1]: 0it [00:00, ?it/s]

In another test, I set train_batch_size to 8 and the len(dataloader) is 1 now in each episode, but 16 not works.

So, my questions are:

  1. Where is the possible bug?
  2. What is the proper way to reproduce the benchmark?
  3. What are the meanings of experience_batch_size,train_batch_size and how should I configure them?

Environment

No response

CWHer commented 10 months ago

Thanks for reporting this issue.

The following description explains how this fails and what are the meanings of these parameters. (problem 1 & 3)

for episode in range(num_episodes):

    for collect_step in tqdm.trange(num_collect_steps):
        self._collect_phase(collect_step)

    if not self.sample_buffer:
        self.dataloader = self.strategy.setup_dataloader(self.data_buffer, self.dataloader_pin_memory)
    for update_step in range(num_update_steps):
        self._update_phase(update_step)

    self.data_buffer.clear()

The above code illustrates a simplified structure of PPO Trainer.

As for problem 2, I believe the document is old of date after some major revisions of Colossal Chat. It requires further investigation regarding the DeepSpeedChat experiment config. And it may take some time for us to update it. (or maybe you can help us align it, you are welcome to contribute to ColossalAI)

wzj423 commented 10 months ago

Ok, so train_batch_size is the local batch size, but experience_batch_size is somehow the global batch size and experience_batch_size * num_collect_steps is the number of samples to be generated globally? If that is true, is it means that I should always keep experience_batch_size * num_collect_steps == train_batch_size * nproc_per_node * num_update_steps (with sample_buffer == True) when I training?

Besides, in this case(colossalai_zero or colossalai_gemini), by logging the tensor size each model takes, I have noticed that in the generation and inference step (they are in the experience_maker class), each process (GPU) is receiving a tensor of size [64, seq_len] (i.e. [experience_batch_size, seq_len]) . And in the PPOTrainer, each process is receiving a tensor of size [train_batch_size, seq_len]. Does the experience-making stage use some different type of parallelization? I mean, with DDP (or ZeRO/Gemini) you will have your data splited across GPUs but the experience-making stage behaves like MP instead of DP.

Thanks for your reply!

CWHer commented 10 months ago

If that is true, is it means that I should always keep experience_batch_size * num_collect_steps == train_batch_size * nproc_per_node * num_update_steps (with sample_buffer == True) when I training?

Actually, it can work as long as experience_batch_size * num_collect_steps >= train_batch_size * nproc_per_node. num_update_steps can be set freely in case you want to use each data multiple times (it is valid in RL, called sample-to-insert ratio or replay ratio).

Besides, in this case(colossalai_zero or colossalai_gemini), by logging the tensor size each model takes, I have noticed that in the generation and inference step (they are in the experience_maker class), each process (GPU) is receiving a tensor of size [64, seq_len] (i.e. [experience_batch_size, seq_len]) . And in the PPOTrainer, each process is receiving a tensor of size [train_batch_size, seq_len].

Does the experience-making stage use some different type of parallelization? I mean, with DDP (or ZeRO/Gemini) you will have your data splited across GPUs but the experience-making stage behaves like MP instead of DP.

Yes, that's true. I examined the code and found that generation does do the same thing (have the same data) across each process, and I believe that is redundant. In other words, it just keeps the entire model on each process and generates using the same [64, seq_len] data (like there is no parallelization, no MP nor DP).

This may be a legacy bug (design) in the code. As this part of the code is under refactoring, there may not be a hotfix to solve this issue, and it may take certain time before the revisions are merged.

For now, I post a possible solution (not tested) here.

--- a/applications/Chat/benchmarks/benchmark_opt_lora_dummy.py
+++ b/applications/Chat/benchmarks/benchmark_opt_lora_dummy.py
@@ -141,10 +141,13 @@ def main(args):
     tokenizer.padding_side = "left"

     (actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim))
-
     random_prompts = torch.randint(tokenizer.vocab_size, (1000, 256), device=torch.cuda.current_device())
-    dataloader = DataLoader(
-        random_prompts, batch_size=args.experience_batch_size, shuffle=True, collate_fn=preprocess_batch
+    dataloader = strategy.plugin.prepare_dataloader(
+        random_prompts,
+        batch_size=args.experience_batch_size,
+        shuffle=True,
+        drop_last=True,
+        collate_fn=preprocess_batch,
     )

First, set up dataloader using prepare_dataloader instead of manually set seed. This will set different seed for different process (i.e., split data across multiple process).

for episode in range(num_episodes):

    for collect_step in tqdm.trange(num_collect_steps):
        self._collect_phase(collect_step)
    # HERE: use torch.distributed.all_gather
    if not self.sample_buffer:
        self.dataloader = self.strategy.setup_dataloader(self.data_buffer, self.dataloader_pin_memory)
    for update_step in range(num_update_steps):
        self._update_phase(update_step)

    self.data_buffer.clear()

Second, after generation is done, invoke torch.distributed.all_gather to gather all the sharded data for all processes.