huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19k stars 2.63k forks source link

Multi-task dataset mixing #217

Open ghomasHudson opened 4 years ago

ghomasHudson commented 4 years ago

It seems like many of the best performing models on the GLUE benchmark make some use of multitask learning (simultaneous training on multiple tasks).

The T5 paper highlights multiple ways of mixing the tasks together during finetuning:

Following this discussion https://github.com/huggingface/transformers/issues/4340 in transformers, @enzoampil suggested that the nlp library might be a better place for this functionality.

Some method for combining datasets could be implemented ,e.g.

dataset = nlp.load_multitask(['squad','imdb','cnn_dm'], temperature=2.0, ...)

We would need a few additions:

It would be great to support common use cases such as pretraining on the GLUE benchmark before fine-tuning on each GLUE task in turn.

I'm willing to write bits/most of this I just need some guidance on the interface and other library details so I can integrate it properly.

patrickvonplaten commented 4 years ago

I like this feature! I think the first question we should decide on is how to convert all datasets into the same format. In T5, the authors decided to format every dataset into a text-to-text format. If the dataset had "multiple" inputs like MNLI, the inputs were concatenated. So in MNLI the input:

  • Hypothesis: The St. Louis Cardinals have always won.

  • Premise: yeah well losing is i mean i’m i’m originally from Saint Louis and Saint Louis Cardinals when they were there were uh a mostly a losing team but

was flattened to a single input:

mnli hypothesis: The St. Louis Cardinals have always won. premise: yeah well losing is i mean i’m i’m originally from Saint Louis and Saint Louis Cardinals when they were there were uh a mostly a losing team but.

This flattening is actually a very simple operation in nlp already. You would just need to do the following:

def flatten_inputs(example):
    return {"input": "mnli hypothesis: " + example['hypothesis'] + " premise: " + example['premise']}

t5_ready_mnli_ds = mnli_ds.map(flatten_inputs, remove_columns=[<all columns except output>])

So I guess converting the datasets into the same format can be left to the user for now. Then the question is how we can merge the datasets. I would probably be in favor of a simple

dataset.add()

function that checks if the dataset is of the same format and if yes merges the two datasets. Finally, how should the sampling be implemented? Examples-proportional mixing corresponds to just merging the datasets and shuffling. For the other two sampling approaches we would need some higher-level features, maybe even a dataset.sample() function for merged datasets.

What are your thoughts on this @thomwolf @lhoestq @ghomasHudson @enzoampil ?

ghomasHudson commented 4 years ago

I agree that we should leave the flattening of the dataset to the user for now. Especially because although the T5 framing seems obvious, there are slight variations on how the T5 authors do it in comparison to other approaches such as gpt-3 and decaNLP.

In terms of sampling, Examples-proportional mixing does seem the simplest to implement so would probably be a good starting point.

Temperature-scaled mixing would probably most useful, offering flexibility as it can simulate the other 2 methods by setting the temperature parameter. There is a relevant part of the T5 repo which should help with implementation.

According to the T5 authors, equal-mixing performs worst. Among the other two methods, tuning the K value (the artificial dataset size limit) has a large impact.

enzoampil commented 4 years ago

I agree with going with temperature-scaled mixing for its flexibility!

For the function that combines the datasets, I also find dataset.add() okay while also considering that users may want it to be easy to combine a list of say 10 data sources in one go.

dataset.sample() should also be good. By the looks of it, we're planning to have as main parameters: temperature, and K.

On converting the datasets to the same format, I agree that we can leave these to the users for now. But, I do imagine it'd be an awesome feature for the future to have this automatically handled, based on a chosen approach to formatting :smile:

E.g. T5, GPT-3, decaNLP, original raw formatting, or a contributed way of formatting in text-to-text.

thomwolf commented 4 years ago

This is an interesting discussion indeed and it would be nice to make multi-task easier.

Probably the best would be to have a new type of dataset especially designed for that in order to easily combine and sample from the multiple datasets.

This way we could probably handle the combination of datasets with differing schemas as well (unlike T5).

ghomasHudson commented 4 years ago

@thomwolf Are you suggesting making a wrapper class which can take existing datasets as arguments and do all the required sampling/combining, to present the same interface as a normal dataset?

That doesn't seem too complicated to implement.

ghomasHudson commented 4 years ago

I guess we're looking at the end user writing something like:

ds = nlp.load_dataset('multitask-t5',datasets=["squad","cnn_dm",...], k=1000, t=2.0)

Using the t5 method of combining here (or this could be a function passed in as an arg)

Passing kwargs to each 'sub-dataset' might become tricky.

ghomasHudson commented 4 years ago

From thinking upon @thomwolf 's suggestion, I've started experimenting:

class MultitaskDataset(DatasetBuilder):
    def __init__(self, *args, **kwargs):
        super(MultitaskDataset, self).__init__(*args, **kwargs)
        self._datasets = kwargs.get("datasets")

    def _info(self):
        return nlp.DatasetInfo(
            description=_DESCRIPTION,
            features=nlp.Features({
                    "source": nlp.Value("string"),
                    "target": nlp.Sequence(nlp.Value("string"))
                })
        )

    def _get_common_splits(self):
        '''Finds the common splits present in all self._datasets'''
        min_set = None
        for dataset in self._datasets:
            if min_set != None:
                min_set.intersection(set(dataset.keys()))
            else:
                min_set = set(dataset.keys())
        return min_set

....

# Maybe this?:
squad = nlp.load_dataset("squad")
cnn_dm = nlp.load_dataset("cnn_dailymail","3.0.0")
multitask_dataset = nlp.load_dataset(
    'multitask_dataset',
    datasets=[squad,cnn_dailymail], 
    k=1000, 
    t=2.0
)

Does anyone know what methods of MultitaskDataset I would need to implement? Maybe as_dataset and download_and_prepare? Most of these should be just calling the methods of the sub-datasets.

I'm assuming DatasetBuilder is better than the more specific GeneratorBasedBuilder, BeamBasedBuilder, etc....

One of the other problems is that the dataset size is unknown till you construct it (as you can pick the sub-datasets). Am hoping not to need to make changes to nlp.load_dataset just for this class.

I'd appreciate it if anyone more familiar with nlp's internal workings could tell me if I'm on the right track!

thomwolf commented 4 years ago

I think I would probably go for a MultiDataset wrapper around a list of Dataset.

I'm not sure we need to give it k and t parameters at creation, it can maybe be something along the lines of:

squad = nlp.load_dataset("squad")
cnn_dm = nlp.load_dataset("cnn_dailymail","3.0.0")

multitask_dataset = nlp.MultiDataset(squad, cnn_dm)

batch = multitask_dataset.sample(10, temperature=2.0, k=1000)

The first proof-of-concept for multi-task datasets could definitely require that the provided datasets have the same name/type for columns (if needed you easily rename/cast a column prior to instantiating the MultiDataset).

It's good to think about it for some time though and don't overfit too much on the T5 examples (in particular for the ways/kwargs for sampling among datasets).

ghomasHudson commented 4 years ago

The problem with changing k and t per sampling is that you'd have to somehow remember which examples you'd already returned while re-weighting the remaining examples based on the new k and tvalues. It seems possible but complicated (I can't really see a reason why you'd want to change the weighting of datasets after you constructed the multidataset).

Wouldn't it be convenient if it implemented the dataset interface? Then if someone has code using a single nlp dataset, they can replace it with a multitask combination of more datasets without having to change other code. We would at least need to be able to pass it into a DataLoader.

ghomasHudson commented 4 years ago

A very janky (but working) implementation of multitask_dataset.sample() could be something like this:

import nlp
import torch

class MultiDataset():
    def __init__(self, *args, temperature=2.0, k=1000, maximum=None, scale=1):
        self.datasets = args
        self._dataloaders = {}
        for split in self._get_common_splits():
            split_datasets = [ds[split] for ds in self.datasets]
            mixing_rates = self._calc_mixing_rates(split_datasets,temperature, k, maximum, scale)
            weights = []
            for i in range(len(self.datasets)):
                weights += [mixing_rates[i]]*len(self.datasets[i][split])
            self._dataloaders[split] = torch.utils.data.DataLoader(torch.utils.data.ConcatDataset(split_datasets),
                                                        sampler=torch.utils.data.sampler.WeightedRandomSampler(
                                                            num_samples=len(weights),
                                                            weights = weights,
                                                            replacement=True),
                                                        shuffle=False)

    def _get_common_splits(self):
        '''Finds the common splits present in all self.datasets'''
        min_set = None
        for dataset in self.datasets:
            if min_set != None:
                min_set.intersection(set(dataset.keys()))
            else:
                min_set = set(dataset.keys())
        return min_set

    def _calc_mixing_rates(self,datasets, temperature=2.0, k=1000, maximum=None, scale=1):
       '''Work out the weighting of each dataset based on t and k'''
        mixing_rates = []
        for dataset in datasets:
            rate = len(dataset)
            rate *= scale
            if maximum:
                rate = min(rate, maximum)
            if temperature != 1.0:
                rate = rate ** (1.0/temperature)
            mixing_rates.append(rate)
        return mixing_rates

    def sample(self,n,split):
        batch = []
        for example in self._dataloaders[split]:
            batch.append(example)
            n -= 1
            if n == 0:
                return batch

def flatten(dataset,flatten_fn):
    for k in dataset.keys():
        if isinstance(dataset[k],nlp.Dataset):
            dataset[k] = dataset[k].map(flatten_fn,remove_columns=dataset[k].column_names)

# Squad
def flatten_squad(example):
    return {"source": "squad context: " + example['context'] + " question: " + example['question'],"target":example["answers"]["text"]}
squad = nlp.load_dataset("squad")
flatten(squad,flatten_squad)

# CNN_DM
def flatten_cnn_dm(example):
    return {"source": "cnn_dm: " + example['article'],"target":[example["highlights"]]}
cnn_dm = nlp.load_dataset("cnn_dailymail", "3.0.0")
flatten(cnn_dm,flatten_cnn_dm)

multitask_dataset = MultiDataset(squad, cnn_dm)
batch = multitask_dataset.sample(100,"train")

There's definitely a more sensible way than embedding DataLoaders inside.

thomwolf commented 4 years ago

There is an interesting related investigation by @zphang here https://colab.research.google.com/github/zphang/zphang.github.io/blob/master/files/notebooks/Multi_task_Training_with_Transformers_NLP.ipynb

ghomasHudson commented 4 years ago

Good spot! Here are my thoughts:

ghomasHudson commented 4 years ago

Another thought: Multitasking over benchmarks (represented as Meta-datasets in nlp) is probably a common use case. Would be nice to pass an entire benchmark to our MultiDataset wrapper rather than having to pass individual components.

ghomasHudson commented 4 years ago

Here's a fully working implementation based on the __iter__ function of @zphang.

import nlp
import numpy as np

class MultiDataset:
    def __init__(self,tasks):
        self.tasks = tasks

        # Create random order of tasks
        # Using size-proportional sampling
        task_choice_list = []
        for i, task in enumerate(self.tasks):
            task_choice_list += [i] * len(task)
        task_choice_list = np.array(task_choice_list)
        np.random.shuffle(task_choice_list)

        # Add index into each dataset
        # - We don't want to shuffle within each task
        counters = {}
        self.task_choice_list = []
        for i in range(len(task_choice_list)):
            idx = counters.get(task_choice_list[i],0)
            self.task_choice_list.append((task_choice_list[i],idx))
            counters[task_choice_list[i]] = idx + 1

    def __len__(self):
        return np.sum([len(t) for t in self.tasks])

    def __repr__(self):
        task_str = ", ".join([str(t) for t in self.tasks])
        return f"MultiDataset(tasks: {task_str})"

    def __getitem__(self,key):
        if isinstance(key, int):
            task_idx, example_idx = self.task_choice_list[key]
            task = self.tasks[task_idx]
            example = task[example_idx]
            example["task_name"] = task.info.builder_name
            return example
        elif isinstance(key, slice):
            raise NotImplementedError()

    def __iter__(self):
        for i in range(len(self)):
            yield self[i]

def load_multitask(*datasets):
    '''Create multitask datasets per split'''

    def _get_common_splits(datasets):
        '''Finds the common splits present in all self.datasets'''
        min_set = None
        for dataset in datasets:
            if min_set != None:
                min_set.intersection(set(dataset.keys()))
            else:
                min_set = set(dataset.keys())
        return min_set

    common_splits = _get_common_splits(datasets)
    out = {}
    for split in common_splits:
        out[split] = MultiDataset([d[split] for d in datasets])
    return out

##########################################
# Dataset Flattening

def flatten(dataset,flatten_fn):
    for k in dataset.keys():
        if isinstance(dataset[k],nlp.Dataset):
            dataset[k] = dataset[k].map(flatten_fn,remove_columns=dataset[k].column_names)

# Squad
def flatten_squad(example):
    return {"source": "squad context: " + example['context'] + " question: " + example['question'],
          "target":example["answers"]["text"]}
squad = nlp.load_dataset("squad")
flatten(squad,flatten_squad)

# CNN_DM
def flatten_cnn_dm(example):
    return {"source": "cnn_dm: " + example['article'],"target":[example["highlights"]]}
cnn_dm = nlp.load_dataset("cnn_dailymail", "3.0.0")
flatten(cnn_dm,flatten_cnn_dm)

#############################################

mtds = load_multitask(squad,cnn_dm)

for example in mtds["train"]:
    print(example["task_name"],example["target"])

Let me know if you have any thoughts. I've started using this in some of my projects and it seems to work. If people are happy with the general approach for a first version, I can make a pull request.

zphang commented 4 years ago

Hey! Happy to jump into the discussion here. I'm still getting familiar with bits of this code, but the reasons I sampled over data loaders rather than datasets is 1) ensuring that each sampled batch corresponds to only 1 task (in case of different inputs formats/downstream models) and 2) potentially having different batch sizes per task (e.g. some tasks have very long/short inputs). How are you currently dealing with these in your PR?

ghomasHudson commented 4 years ago

The short answer is - I'm not! Everything is currently on a per-example basis. It would be fairly simple to add a batch_size argument which would ensure that every batch_size examples come from the same task. That should suit most use-cases (unless you wanted to ensure batches all came from the same task and apply something like SortishSampler on each task first)

Your notebook was really inspiring by the way - thanks!

ghomasHudson commented 4 years ago

@zphang is having different batch sizes per task actually helpful? Would be interesting to know as it's not something I've come across as a technique used by any MTL papers.

ghomasHudson commented 4 years ago

mt-dnn's batcher.py might be worth looking at.

ldong87 commented 3 years ago

@zphang is having different batch sizes per task actually helpful? Would be interesting to know as it's not something I've come across as a technique used by any MTL papers.

I think having different batch sizes per task is particularly helpful in some scenarios where each task has different amount of data. For example, the problem I'm currently facing is one task has tens of thousands of samples while one task has a couple hundreds. I think in this case different batch size could help. But if using the same batch size is a lot simpler to implement, I guess it makes sense to go with that.

timothyjlaurent commented 3 years ago

I think that instead of proportional to size sampling you should specify weights or probabilities for drawing a batch from each dataset. We should also ensure that the smaller datasets are repeated so that the encoder layer doesn't overtrain on the largest dataset.

ghomasHudson commented 3 years ago

Are there any references for people doing different batch sizes per task in the literature? I've only seen constant batch sizes with differing numbers of batches for each task which seems sufficient to prevent the impact of large datasets (Read 3.5.3 of the T5 paper for example).

rabeehk commented 3 years ago

Hi, regarding building T5 dataset , I think we can use datasets https://github.com/huggingface/datasets and then need something similar to tf.data.experimental.sample_from_datasets, do you know if similar functionality exist in pytorch? Which can sample multiple datasets with the given rates. thanks.

StefanHeng commented 1 year ago

Is this feature part of a datasets release yet?

StefanHeng commented 1 year ago

Here's a fully working implementation based on the __iter__ function of @zphang.

  • I've generated the task choice list in the constructor as it allows us to index into the MultiDataset just like a normal dataset. I'm changing task_choice_list into a list of (dataset_idx, example_idx) so each entry references a unique dataset example. The shuffling has to be done before this as we don't want to shuffle within each task (we assume this is done by the user if this is what they intend).
  • I'm slightly concerned this list could become very large if many large datasets were used. Can't see a way round it at the moment though.
  • I've used task.info.builder_name as the dataset name. Not sure if this is correct.
  • I'd love to add some of the other Dataset methods (map, slicing by column, etc...). Would be great to implement the whole interface so a single dataset can be simply replaced by this.
  • This does everything on the individual example-level. If some application required batches all from a single task in turn we can't really do that.
import nlp
import numpy as np

class MultiDataset:
    def __init__(self,tasks):
        self.tasks = tasks

        # Create random order of tasks
        # Using size-proportional sampling
        task_choice_list = []
        for i, task in enumerate(self.tasks):
            task_choice_list += [i] * len(task)
        task_choice_list = np.array(task_choice_list)
        np.random.shuffle(task_choice_list)

        # Add index into each dataset
        # - We don't want to shuffle within each task
        counters = {}
        self.task_choice_list = []
        for i in range(len(task_choice_list)):
            idx = counters.get(task_choice_list[i],0)
            self.task_choice_list.append((task_choice_list[i],idx))
            counters[task_choice_list[i]] = idx + 1

    def __len__(self):
        return np.sum([len(t) for t in self.tasks])

    def __repr__(self):
        task_str = ", ".join([str(t) for t in self.tasks])
        return f"MultiDataset(tasks: {task_str})"

    def __getitem__(self,key):
        if isinstance(key, int):
            task_idx, example_idx = self.task_choice_list[key]
            task = self.tasks[task_idx]
            example = task[example_idx]
            example["task_name"] = task.info.builder_name
            return example
        elif isinstance(key, slice):
            raise NotImplementedError()

    def __iter__(self):
        for i in range(len(self)):
            yield self[i]

def load_multitask(*datasets):
    '''Create multitask datasets per split'''

    def _get_common_splits(datasets):
        '''Finds the common splits present in all self.datasets'''
        min_set = None
        for dataset in datasets:
            if min_set != None:
                min_set.intersection(set(dataset.keys()))
            else:
                min_set = set(dataset.keys())
        return min_set

    common_splits = _get_common_splits(datasets)
    out = {}
    for split in common_splits:
        out[split] = MultiDataset([d[split] for d in datasets])
    return out

##########################################
# Dataset Flattening

def flatten(dataset,flatten_fn):
    for k in dataset.keys():
        if isinstance(dataset[k],nlp.Dataset):
            dataset[k] = dataset[k].map(flatten_fn,remove_columns=dataset[k].column_names)

# Squad
def flatten_squad(example):
    return {"source": "squad context: " + example['context'] + " question: " + example['question'],
          "target":example["answers"]["text"]}
squad = nlp.load_dataset("squad")
flatten(squad,flatten_squad)

# CNN_DM
def flatten_cnn_dm(example):
    return {"source": "cnn_dm: " + example['article'],"target":[example["highlights"]]}
cnn_dm = nlp.load_dataset("cnn_dailymail", "3.0.0")
flatten(cnn_dm,flatten_cnn_dm)

#############################################

mtds = load_multitask(squad,cnn_dm)

for example in mtds["train"]:
    print(example["task_name"],example["target"])

Let me know if you have any thoughts. I've started using this in some of my projects and it seems to work. If people are happy with the general approach for a first version, I can make a pull request.

Not sure if this is what I'm looking for, but I implemented a version of Examples-Proportional mixing supporting only the basic feature here, seems to work in my project.

lhoestq commented 1 year ago

You can use interleave_datasets to mix several datasets together. By default it alternates between all the datasets, but you can also provide sampling probabilities if you want to oversample from one of the datasets

from datasets import load_dataset, interleave_datasets

squad = load_dataset("squad", split="train")
cnn_dm = load_dataset("cnn_dailymail", "3.0.0", split="train")
ds = interleave_datasets([squad, cnn_dm])

print(ds[0])
# {'id': '5733be284776f41900661182',
#  'title': 'University_of_Notre_Dame',
#  'context': 'Architecturally, the school has a Catholic character...',
#  'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
#  'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]},
#  'article': None,
#  'highlights': None}
print(ds[1])
# {'id': '42c027e4ff9730fbb3de84c1af0d2c506e41c3e4',
#  'title': None,
#  'context': None,
#  'question': None,
#  'answers': None,
#  'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe...',
#  'highlights': "Harry Potter star Daniel Radcliffe..."}

see docs at https://huggingface.co/docs/datasets/v2.6.1/en/package_reference/main_classes#datasets.interleave_datasets

rabeehk commented 1 year ago

I also have this implementation of multi-task sampler here which I used it to tune T5: https://github.com/rabeehk/hyperformer/blob/main/hyperformer/data/multitask_sampler.py