huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.32k stars 872 forks source link

More than 10 times slowdown between version 0.26.1 and version 0.31.0, EDIT: It was a data loading issue with Hugginface Datasets #2890

Closed marhlder closed 6 days ago

marhlder commented 1 week ago

System Info

Accelerate version 0.26.1 and version 0.31.0, Python 3.10.4, torch 2.3.1, numpy 1.26.4

Launching with
notebook_launcher(self.run_method, [self.configuration], num_processes=torch.cuda.device_count())
from an initial python file, without actually running in a notebook. 

Information

Tasks

Reproduction

We are training a rather large 1.3 Billion parameters T5 model with MoE / Switch-mechanisms on a 16 x A100 GPU machine in GCP. The model works in long input sequences 3072 and shorter output sequences of 192. We tried to update our accelerate library version to the latest version (0.31.0) from 0.26.1 but we experienced a huge increase in training time. The model was still seemingly learning well without any issues. The main motivation for updating our dependency was this issue: https://github.com/huggingface/accelerate/issues/1050 Which didn't seem to be fixed anyways.

The Accelerator object is configured like this:

        ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=find_unused_parameters)
        self.project_configuration = ProjectConfiguration(automatic_checkpoint_naming=True,
                                                          project_dir=self.output_dir, total_limit=save_total_limit,
                                                          )
        # plugin = GradientAccumulationPlugin(
        #       num_steps=gradient_accumulation_steps,
        #       sync_each_batch=False
        # )

        self.accelerator = Accelerator(
            gradient_accumulation_steps=gradient_accumulation_steps,
            #gradient_accumulation_plugin=plugin,
            project_dir=self.output_dir,
            kwargs_handlers=[ddp_kwargs],
            project_config=self.project_configuration,
            mixed_precision=mixed_precision_mode,
            # dispatch_batches=False,
            use_seedable_sampler=True,
            #dataloader_config=DataLoaderConfiguration(use_seedable_sampler=True)
        )

The model object is compiled before being sent through the accelerate prepare method self.model.model = torch.compile(self.model.model)

Accelerate prepare is called like this:

self.model.model, lr_scheduler, optimizer, train_dataset_loader, test_dataset_loader = self.accelerator.prepare(
            self.model.model,
            lr_scheduler,
            optimizer,
            train_dataset_loader,
            test_dataset_loader,
        )

Main training loop looks something like this:

             with self.accelerator.autocast():

                batch_idx = 0
                for batch in train_dataset_loader:
                    with self.accelerator.accumulate(self.model):

                        output = self.model(batch)
                        if isinstance(output, ModelOutput):
                            loss = output.loss
                        else:
                            loss = output[0]

                        self.accelerator.backward(loss)

                        if self.accelerator.sync_gradients and self.max_norm:
                            self.accelerator.clip_grad_norm_(self.model.parameters(), self.max_norm)

                        optimizer_step()
                        optimizer.zero_grad()

                    logger.info(f"State saved epoch {epoch}, step {global_step}, batch_idx {batch_idx}")
                    if batch_idx == 0:
                        logging.info(f"Process {self.accelerator.process_index}, input ids {batch['input_ids'].tolist()}")

                    if not self.accelerator.optimizer_step_was_skipped:
                        batch_idx = batch_idx + 1
                    else:
                        if self.accelerator.is_main_process:
                            print("Skipped optimizer update due to mixed precision gradient")

The optimizer_step() function is defined like this:

           def optimizer_step():
                optimizer.step()
                lr_scheduler.step()

Logs from before:

2024-06-20 19:38:14.676
workerpool0-0
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 1
2024-06-20 19:38:15.675
workerpool0-0
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 2
2024-06-20 19:38:16.676
workerpool0-0
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 3
2024-06-20 19:38:17.675
workerpool0-0
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 4
2024-06-20 19:38:18.676
workerpool0-0
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 5

Logs after updating:

2024-06-24 22:15:56.820 CEST
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 1
2024-06-24 22:16:04.819 CEST
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 2
2024-06-24 22:16:16.820 CEST
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 3
2024-06-24 22:16:27.819 CEST
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 4
2024-06-24 22:16:36.820 CEST
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 5
2024-06-24 22:16:48.819 CEST
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 6
2024-06-24 22:16:59.819 CEST
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 7
2024-06-24 22:17:08.820 CEST
modeling.supervisors.accelerator_supervisor - INFO - State saved epoch 0, step 0, batch_idx 8

These logs show that each iteration over the dataset in the training loop are now significantly slower.

Rolling back to 0.26.1 brings the performance back to the expected levels. We are running with gradient accumulation = 16 CPU and GPU utilization seems comparable in both cases, around 20% for CPU and 96% on average for GPU.

Expected behavior

Expected behavior is similar or possibly better performance when updating Accelerate, or some kind of documentation of what we need to change in order to bring us back to the expected performance.

SunMarc commented 1 week ago

Hi @marhlder, thanks for the detailed report. This is indeed a big issue. If you have time, could you share a minimal reproducer ? Does this happen only in dpp setup for also when training on only one gpu ? Thanks a lot ! cc @muellerzr

marhlder commented 1 week ago

We have not tested this on a single GPU setup yet as single A100 GPU configurations in GCP is currently "unobtainium". It's gonna take me some time to try to replicate this into a small re-producible example as our current setup is is quite modular / split into many files.

SunMarc commented 1 week ago

Got it ! Keep us updated ! In the meantime, we will also try to replicate and fix the issue ! cc @muellerzr

marhlder commented 1 week ago

Hmm it appears that it's maybe not relevant to Accelerate. I mistakenly though it was fixed just by downgrading to 0.26.1, but it seems it's not. I will investigate further.

marhlder commented 6 days ago

Okay, I'm sorry, but it was a false alarm after all. It turns out that I was also switching between two data setups when I was switching between the two versions of Accelerate.

It turns out that the real underlying issue is this thing from Huggingface's Datasets library: https://github.com/huggingface/datasets/issues/6637

What got me confused was that the, by GCP, reported GPU utilization was still very high, so I didn't suspect that it was a data loading problem. But I guess it was possibly copying back and forth between CPU / GPU or doing some kinda polling to get the data?

Anyways, not using the with_format() API and performing my own map() operation to convert my values into tensors seems to work much better, it's still slower overall though. But NOT due to Accelerate it seems.

SunMarc commented 6 days ago

Awesome ! Thanks for the update @marhlder !