Open wzj423 opened 10 months ago
Thanks for reporting this issue.
The following description explains how this fails and what are the meanings of these parameters. (problem 1 & 3)
for episode in range(num_episodes):
for collect_step in tqdm.trange(num_collect_steps):
self._collect_phase(collect_step)
if not self.sample_buffer:
self.dataloader = self.strategy.setup_dataloader(self.data_buffer, self.dataloader_pin_memory)
for update_step in range(num_update_steps):
self._update_phase(update_step)
self.data_buffer.clear()
The above code illustrates a simplified structure of PPO Trainer.
experience_batch_size
* num_collect_steps
data (samples) will be generated, while in the update phase, things get a little more complicated.
sample_buffer
is True
, num_update_steps
steps (i.e., one forward + one backward + one optimizer step) will be performed. sample_buffer
is False
, num_update_steps
epochs (i.e., go through the entire dataset, containing several steps) will be performed instead. train_batch_size
indicates the batch size used in self._update_phase()
, but it is also related with --strategy
parameter. In this case (zero strategy), train_batch_size
represents local batch size, and train_batch_size
* nproc_per_node
is the global batch size. That is why train_batch_size=8
yields len(dataloader)=1
(global batch size is 64).
As for problem 2, I believe the document is old of date after some major revisions of Colossal Chat. It requires further investigation regarding the DeepSpeedChat experiment config. And it may take some time for us to update it. (or maybe you can help us align it, you are welcome to contribute to ColossalAI)
Ok, so train_batch_size
is the local batch size, but experience_batch_size
is somehow the global batch size and experience_batch_size * num_collect_steps
is the number of samples to be generated globally? If that is true, is it means that I should always keep experience_batch_size * num_collect_steps == train_batch_size * nproc_per_node * num_update_steps
(with sample_buffer == True
) when I training?
Besides, in this case(colossalai_zero
or colossalai_gemini
), by logging the tensor size each model takes, I have noticed that in the generation
and inference
step (they are in the experience_maker class), each process (GPU) is receiving a tensor of size [64, seq_len]
(i.e. [experience_batch_size, seq_len]
) . And in the PPOTrainer, each process is receiving a tensor of size [train_batch_size, seq_len]
.
Does the experience-making stage use some different type of parallelization? I mean, with DDP (or ZeRO/Gemini) you will have your data splited across GPUs but the experience-making stage behaves like MP instead of DP.
Thanks for your reply!
If that is true, is it means that I should always keep
experience_batch_size * num_collect_steps == train_batch_size * nproc_per_node * num_update_steps
(withsample_buffer == True
) when I training?
Actually, it can work as long as experience_batch_size * num_collect_steps >= train_batch_size * nproc_per_node
. num_update_steps
can be set freely in case you want to use each data multiple times (it is valid in RL, called sample-to-insert ratio or replay ratio).
Besides, in this case(
colossalai_zero
orcolossalai_gemini
), by logging the tensor size each model takes, I have noticed that in thegeneration
andinference
step (they are in the experience_maker class), each process (GPU) is receiving a tensor of size[64, seq_len]
(i.e.[experience_batch_size, seq_len]
) . And in the PPOTrainer, each process is receiving a tensor of size[train_batch_size, seq_len]
.Does the experience-making stage use some different type of parallelization? I mean, with DDP (or ZeRO/Gemini) you will have your data splited across GPUs but the experience-making stage behaves like MP instead of DP.
Yes, that's true. I examined the code and found that generation
does do the same thing (have the same data) across each process, and I believe that is redundant. In other words, it just keeps the entire model on each process and generates using the same [64, seq_len]
data (like there is no parallelization, no MP nor DP).
This may be a legacy bug (design) in the code. As this part of the code is under refactoring, there may not be a hotfix to solve this issue, and it may take certain time before the revisions are merged.
For now, I post a possible solution (not tested) here.
--- a/applications/Chat/benchmarks/benchmark_opt_lora_dummy.py
+++ b/applications/Chat/benchmarks/benchmark_opt_lora_dummy.py
@@ -141,10 +141,13 @@ def main(args):
tokenizer.padding_side = "left"
(actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim))
-
random_prompts = torch.randint(tokenizer.vocab_size, (1000, 256), device=torch.cuda.current_device())
- dataloader = DataLoader(
- random_prompts, batch_size=args.experience_batch_size, shuffle=True, collate_fn=preprocess_batch
+ dataloader = strategy.plugin.prepare_dataloader(
+ random_prompts,
+ batch_size=args.experience_batch_size,
+ shuffle=True,
+ drop_last=True,
+ collate_fn=preprocess_batch,
)
First, set up dataloader using prepare_dataloader
instead of manually set seed. This will set different seed for different process (i.e., split data across multiple process).
for episode in range(num_episodes):
for collect_step in tqdm.trange(num_collect_steps):
self._collect_phase(collect_step)
# HERE: use torch.distributed.all_gather
if not self.sample_buffer:
self.dataloader = self.strategy.setup_dataloader(self.data_buffer, self.dataloader_pin_memory)
for update_step in range(num_update_steps):
self._update_phase(update_step)
self.data_buffer.clear()
Second, after generation
is done, invoke torch.distributed.all_gather
to gather all the sharded data for all processes.
🐛 Describe the bug
I was trying to reproduce the benchmark results on https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/README.md which says:
Cloning your latest main branch, I ran the benchmark with the same command, 8 A800-80g GPUs and the script simply skips the training process after the inference step has finished.
Here's what I've run:
I have add more logging outputs at coati/trainer/ppo.py line 253:
The log shows that there is not enough data to make a single batch in the dataloader so the trainning process is not executed at all.
In another test, I set
train_batch_size
to 8 and thelen(dataloader)
is 1 now in each episode, but 16 not works.So, my questions are:
experience_batch_size,train_batch_size
and how should I configure them?Environment
No response