huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.8k stars 26.96k forks source link

Edit train_dataloader in callbacks #25925

Open arshadshk opened 1 year ago

arshadshk commented 1 year ago

In certain scenarios, it becomes necessary to modify the train_dataloader after each epoch. This can be managed through callbacks since the train dataloader is passed to the callback function. However, it's worth noting that the dataloader for training is not a variable within the base Trainer class. Instead, an independent variable called train_dataloader is created at the beginning of training. Unfortunately, when we attempt to make changes to this dataloader within callbacks, it doesn't seem to have the desired effect.

Here's a code example illustrating this issue:

class CustomCallback(TrainerCallback):
    def on_epoch_begin(self, args, state, control, model, tokenizer, **kwargs):
        kwargs['train_dataloader'] = new_dataloader

The problem appears to be that even though we try to update train_dataloader within the callback, it doesn't affect the dataloader as expected.

Please share a reference to the data loader utilized for training in the callback, rather than a duplicate, to allow for modifications to the data loader within callbacks. Currently, there is no other method available to make changes to the data loader after each epoch or after a certain number of steps.

To make the train_dataloader editable by callbacks, it would be beneficial to associate it directly with the Trainer class, rather than creating it as an independent variable. This would enable callbacks to modify it effectively, as the current implementation creates a copy that is not affected by changes made within callbacks.

Who can help?

@muellerzr @pacman100 @Narsil

Information

Tasks

Reproduction

class CustomCallback(TrainerCallback):
    def on_epoch_begin(self, args, state, control, model, tokenizer, **kwargs):
        kwargs['train_dataloader'] = new_dataloader

Expected behavior

The dataloader used for training should be replaced by the new_dataloader passed in callbacks

muellerzr commented 1 year ago

What expectation of modifying the dataloader are you expecting? Trainer uses it's own get_train_dataloader, get_eval_dataloader, and get_test_dataloader functions and has for a while, so what specific modifications to the dataloader you are looking for would be good to know, else this may require some substantial rewrites of a few areas.

Most notably: dataloaders are sent through accelerator.prepare, so the original dataloader is one modified for distributed systems, and really shouldn't be touched much.

dumpmemory commented 1 year ago

@muellerzr for sometimes, we might want to use https://github.com/imoneoi/multipack_sampler/blob/master/multipack_sampler.py#L94 for efficient reasoning

muellerzr commented 1 year ago

So we care about a custom sampler to use instead here?

dumpmemory commented 1 year ago

So we care about a custom sampler to use instead here?

Yeah. I might need to open a new feature request instead of putting things here

muellerzr commented 11 months ago

A new feature request would be good @dumpmemory, as that's very different. Honestly I'd open that on the Accelerate side probably since it's what those dataloaders use in DDP. @arshadshk can you provide more details on what you want to modify in the dataloader? We have a lot of preprocessing etc we need to do on the dataloader, which is why we keep it that way in general. More clarity on what you are specifically wanting to do here with it would be great.

b96705008 commented 10 months ago

We also wish that we can have this new feature. We found that we may not be able to modify the IterableDataset passed to Trainer then create the train dataloader. We probably need to update it on every epoch or the train loss would get suddenly dropped and increase every new epoch. Please let me know if there is a solution for this.