Open ghomasHudson opened 4 years ago
I like this feature! I think the first question we should decide on is how to convert all datasets into the same format. In T5, the authors decided to format every dataset into a text-to-text format. If the dataset had "multiple" inputs like MNLI, the inputs were concatenated. So in MNLI the input:
Hypothesis: The St. Louis Cardinals have always won.
Premise: yeah well losing is i mean i’m i’m originally from Saint Louis and Saint Louis Cardinals when they were there were uh a mostly a losing team but
was flattened to a single input:
mnli hypothesis: The St. Louis Cardinals have always won. premise: yeah well losing is i mean i’m i’m originally from Saint Louis and Saint Louis Cardinals when they were there were uh a mostly a losing team but.
This flattening is actually a very simple operation in nlp
already. You would just need to do the following:
def flatten_inputs(example):
return {"input": "mnli hypothesis: " + example['hypothesis'] + " premise: " + example['premise']}
t5_ready_mnli_ds = mnli_ds.map(flatten_inputs, remove_columns=[<all columns except output>])
So I guess converting the datasets into the same format can be left to the user for now. Then the question is how we can merge the datasets. I would probably be in favor of a simple
dataset.add()
function that checks if the dataset is of the same format and if yes merges the two datasets. Finally, how should the sampling be implemented? Examples-proportional mixing corresponds to just merging the datasets and shuffling. For the other two sampling approaches we would need some higher-level features, maybe even a dataset.sample()
function for merged datasets.
What are your thoughts on this @thomwolf @lhoestq @ghomasHudson @enzoampil ?
I agree that we should leave the flattening of the dataset to the user for now. Especially because although the T5 framing seems obvious, there are slight variations on how the T5 authors do it in comparison to other approaches such as gpt-3 and decaNLP.
In terms of sampling, Examples-proportional mixing does seem the simplest to implement so would probably be a good starting point.
Temperature-scaled mixing would probably most useful, offering flexibility as it can simulate the other 2 methods by setting the temperature parameter. There is a relevant part of the T5 repo which should help with implementation.
According to the T5 authors, equal-mixing performs worst. Among the other two methods, tuning the K value (the artificial dataset size limit) has a large impact.
I agree with going with temperature-scaled mixing for its flexibility!
For the function that combines the datasets, I also find dataset.add()
okay while also considering that users may want it to be easy to combine a list of say 10 data sources in one go.
dataset.sample()
should also be good. By the looks of it, we're planning to have as main parameters: temperature
, and K
.
On converting the datasets to the same format, I agree that we can leave these to the users for now. But, I do imagine it'd be an awesome feature for the future to have this automatically handled, based on a chosen approach to formatting :smile:
E.g. T5, GPT-3, decaNLP, original raw formatting, or a contributed way of formatting in text-to-text.
This is an interesting discussion indeed and it would be nice to make multi-task easier.
Probably the best would be to have a new type of dataset especially designed for that in order to easily combine and sample from the multiple datasets.
This way we could probably handle the combination of datasets with differing schemas as well (unlike T5).
@thomwolf Are you suggesting making a wrapper class which can take existing datasets as arguments and do all the required sampling/combining, to present the same interface as a normal dataset?
That doesn't seem too complicated to implement.
I guess we're looking at the end user writing something like:
ds = nlp.load_dataset('multitask-t5',datasets=["squad","cnn_dm",...], k=1000, t=2.0)
Using the t5 method of combining here (or this could be a function passed in as an arg)
Passing kwargs to each 'sub-dataset' might become tricky.
From thinking upon @thomwolf 's suggestion, I've started experimenting:
class MultitaskDataset(DatasetBuilder):
def __init__(self, *args, **kwargs):
super(MultitaskDataset, self).__init__(*args, **kwargs)
self._datasets = kwargs.get("datasets")
def _info(self):
return nlp.DatasetInfo(
description=_DESCRIPTION,
features=nlp.Features({
"source": nlp.Value("string"),
"target": nlp.Sequence(nlp.Value("string"))
})
)
def _get_common_splits(self):
'''Finds the common splits present in all self._datasets'''
min_set = None
for dataset in self._datasets:
if min_set != None:
min_set.intersection(set(dataset.keys()))
else:
min_set = set(dataset.keys())
return min_set
....
# Maybe this?:
squad = nlp.load_dataset("squad")
cnn_dm = nlp.load_dataset("cnn_dailymail","3.0.0")
multitask_dataset = nlp.load_dataset(
'multitask_dataset',
datasets=[squad,cnn_dailymail],
k=1000,
t=2.0
)
Does anyone know what methods of MultitaskDataset
I would need to implement? Maybe as_dataset
and download_and_prepare
? Most of these should be just calling the methods of the sub-datasets.
I'm assuming DatasetBuilder is better than the more specific GeneratorBasedBuilder
, BeamBasedBuilder
, etc....
One of the other problems is that the dataset size is unknown till you construct it (as you can pick the sub-datasets). Am hoping not to need to make changes to nlp.load_dataset
just for this class.
I'd appreciate it if anyone more familiar with nlp's internal workings could tell me if I'm on the right track!
I think I would probably go for a MultiDataset
wrapper around a list of Dataset
.
I'm not sure we need to give it k
and t
parameters at creation, it can maybe be something along the lines of:
squad = nlp.load_dataset("squad")
cnn_dm = nlp.load_dataset("cnn_dailymail","3.0.0")
multitask_dataset = nlp.MultiDataset(squad, cnn_dm)
batch = multitask_dataset.sample(10, temperature=2.0, k=1000)
The first proof-of-concept for multi-task datasets could definitely require that the provided datasets have the same name/type for columns (if needed you easily rename/cast a column prior to instantiating the MultiDataset
).
It's good to think about it for some time though and don't overfit too much on the T5 examples (in particular for the ways/kwargs for sampling among datasets).
The problem with changing k
and t
per sampling is that you'd have to somehow remember which examples you'd already returned while re-weighting the remaining examples based on the new k
and t
values. It seems possible but complicated (I can't really see a reason why you'd want to change the weighting of datasets after you constructed the multidataset).
Wouldn't it be convenient if it implemented the dataset interface? Then if someone has code using a single nlp dataset, they can replace it with a multitask combination of more datasets without having to change other code. We would at least need to be able to pass it into a DataLoader
.
A very janky (but working) implementation of multitask_dataset.sample()
could be something like this:
import nlp
import torch
class MultiDataset():
def __init__(self, *args, temperature=2.0, k=1000, maximum=None, scale=1):
self.datasets = args
self._dataloaders = {}
for split in self._get_common_splits():
split_datasets = [ds[split] for ds in self.datasets]
mixing_rates = self._calc_mixing_rates(split_datasets,temperature, k, maximum, scale)
weights = []
for i in range(len(self.datasets)):
weights += [mixing_rates[i]]*len(self.datasets[i][split])
self._dataloaders[split] = torch.utils.data.DataLoader(torch.utils.data.ConcatDataset(split_datasets),
sampler=torch.utils.data.sampler.WeightedRandomSampler(
num_samples=len(weights),
weights = weights,
replacement=True),
shuffle=False)
def _get_common_splits(self):
'''Finds the common splits present in all self.datasets'''
min_set = None
for dataset in self.datasets:
if min_set != None:
min_set.intersection(set(dataset.keys()))
else:
min_set = set(dataset.keys())
return min_set
def _calc_mixing_rates(self,datasets, temperature=2.0, k=1000, maximum=None, scale=1):
'''Work out the weighting of each dataset based on t and k'''
mixing_rates = []
for dataset in datasets:
rate = len(dataset)
rate *= scale
if maximum:
rate = min(rate, maximum)
if temperature != 1.0:
rate = rate ** (1.0/temperature)
mixing_rates.append(rate)
return mixing_rates
def sample(self,n,split):
batch = []
for example in self._dataloaders[split]:
batch.append(example)
n -= 1
if n == 0:
return batch
def flatten(dataset,flatten_fn):
for k in dataset.keys():
if isinstance(dataset[k],nlp.Dataset):
dataset[k] = dataset[k].map(flatten_fn,remove_columns=dataset[k].column_names)
# Squad
def flatten_squad(example):
return {"source": "squad context: " + example['context'] + " question: " + example['question'],"target":example["answers"]["text"]}
squad = nlp.load_dataset("squad")
flatten(squad,flatten_squad)
# CNN_DM
def flatten_cnn_dm(example):
return {"source": "cnn_dm: " + example['article'],"target":[example["highlights"]]}
cnn_dm = nlp.load_dataset("cnn_dailymail", "3.0.0")
flatten(cnn_dm,flatten_cnn_dm)
multitask_dataset = MultiDataset(squad, cnn_dm)
batch = multitask_dataset.sample(100,"train")
There's definitely a more sensible way than embedding DataLoader
s inside.
There is an interesting related investigation by @zphang here https://colab.research.google.com/github/zphang/zphang.github.io/blob/master/files/notebooks/Multi_task_Training_with_Transformers_NLP.ipynb
Good spot! Here are my thoughts:
MultitaskModel
to transformers might be a thing to raise - even though having task-specific heads has become unfashionable in recent times in favour of text-to-text type models.map
datasets into a common form.The size-proportional sampling (also called "Examples-proportional mixing") used here doesn't perform too badly in the T5 paper (it's comparable to temperature-scaled mixing in many cases but less flexible. This is only reasonable with a K
maximum size parameter to prevent very large datasets dominating). This might be good for a first prototype using:
def __iter__(self):
"""
For each batch, sample a task, and yield a batch from the respective
task Dataloader.
We use size-proportional sampling, but you could easily modify this
to sample from some-other distribution.
"""
task_choice_list = []
for i, task_name in enumerate(self.task_name_list):
task_choice_list += [i] * self.num_batches_dict[task_name]
task_choice_list = np.array(task_choice_list)
np.random.shuffle(task_choice_list)
dataloader_iter_dict = {
task_name: iter(dataloader)
for task_name, dataloader in self.dataloader_dict.items()
}
for task_choice in task_choice_list:
task_name = self.task_name_list[task_choice]
yield next(dataloader_iter_dict[task_name])
We'd just need to pull samples from the raw datasets and not from DataLoader
s for each task. We can assume the user has done dataset.shuffle()
if they want to.
Other sampling methods can later be implemented by changing how the task_choice_list
is generated. This should allow more flexibility and not tie us to specific methods for sampling among datasets.
Another thought: Multitasking over benchmarks (represented as Meta-datasets in nlp) is probably a common use case. Would be nice to pass an entire benchmark to our MultiDataset
wrapper rather than having to pass individual components.
Here's a fully working implementation based on the __iter__
function of @zphang.
task_choice_list
into a list of (dataset_idx, example_idx)
so each entry references a unique dataset example. The shuffling has to be done before this as we don't want to shuffle within each task (we assume this is done by the user if this is what they intend).task.info.builder_name
as the dataset name. Not sure if this is correct.Dataset
methods (map, slicing by column, etc...). Would be great to implement the whole interface so a single dataset can be simply replaced by this.import nlp
import numpy as np
class MultiDataset:
def __init__(self,tasks):
self.tasks = tasks
# Create random order of tasks
# Using size-proportional sampling
task_choice_list = []
for i, task in enumerate(self.tasks):
task_choice_list += [i] * len(task)
task_choice_list = np.array(task_choice_list)
np.random.shuffle(task_choice_list)
# Add index into each dataset
# - We don't want to shuffle within each task
counters = {}
self.task_choice_list = []
for i in range(len(task_choice_list)):
idx = counters.get(task_choice_list[i],0)
self.task_choice_list.append((task_choice_list[i],idx))
counters[task_choice_list[i]] = idx + 1
def __len__(self):
return np.sum([len(t) for t in self.tasks])
def __repr__(self):
task_str = ", ".join([str(t) for t in self.tasks])
return f"MultiDataset(tasks: {task_str})"
def __getitem__(self,key):
if isinstance(key, int):
task_idx, example_idx = self.task_choice_list[key]
task = self.tasks[task_idx]
example = task[example_idx]
example["task_name"] = task.info.builder_name
return example
elif isinstance(key, slice):
raise NotImplementedError()
def __iter__(self):
for i in range(len(self)):
yield self[i]
def load_multitask(*datasets):
'''Create multitask datasets per split'''
def _get_common_splits(datasets):
'''Finds the common splits present in all self.datasets'''
min_set = None
for dataset in datasets:
if min_set != None:
min_set.intersection(set(dataset.keys()))
else:
min_set = set(dataset.keys())
return min_set
common_splits = _get_common_splits(datasets)
out = {}
for split in common_splits:
out[split] = MultiDataset([d[split] for d in datasets])
return out
##########################################
# Dataset Flattening
def flatten(dataset,flatten_fn):
for k in dataset.keys():
if isinstance(dataset[k],nlp.Dataset):
dataset[k] = dataset[k].map(flatten_fn,remove_columns=dataset[k].column_names)
# Squad
def flatten_squad(example):
return {"source": "squad context: " + example['context'] + " question: " + example['question'],
"target":example["answers"]["text"]}
squad = nlp.load_dataset("squad")
flatten(squad,flatten_squad)
# CNN_DM
def flatten_cnn_dm(example):
return {"source": "cnn_dm: " + example['article'],"target":[example["highlights"]]}
cnn_dm = nlp.load_dataset("cnn_dailymail", "3.0.0")
flatten(cnn_dm,flatten_cnn_dm)
#############################################
mtds = load_multitask(squad,cnn_dm)
for example in mtds["train"]:
print(example["task_name"],example["target"])
Let me know if you have any thoughts. I've started using this in some of my projects and it seems to work. If people are happy with the general approach for a first version, I can make a pull request.
Hey! Happy to jump into the discussion here. I'm still getting familiar with bits of this code, but the reasons I sampled over data loaders rather than datasets is 1) ensuring that each sampled batch corresponds to only 1 task (in case of different inputs formats/downstream models) and 2) potentially having different batch sizes per task (e.g. some tasks have very long/short inputs). How are you currently dealing with these in your PR?
The short answer is - I'm not! Everything is currently on a per-example basis. It would be fairly simple to add a batch_size
argument which would ensure that every batch_size
examples come from the same task. That should suit most use-cases (unless you wanted to ensure batches all came from the same task and apply something like SortishSampler
on each task first)
Your notebook was really inspiring by the way - thanks!
@zphang is having different batch sizes per task actually helpful? Would be interesting to know as it's not something I've come across as a technique used by any MTL papers.
mt-dnn's batcher.py might be worth looking at.
@zphang is having different batch sizes per task actually helpful? Would be interesting to know as it's not something I've come across as a technique used by any MTL papers.
I think having different batch sizes per task is particularly helpful in some scenarios where each task has different amount of data. For example, the problem I'm currently facing is one task has tens of thousands of samples while one task has a couple hundreds. I think in this case different batch size could help. But if using the same batch size is a lot simpler to implement, I guess it makes sense to go with that.
I think that instead of proportional to size sampling you should specify weights or probabilities for drawing a batch from each dataset. We should also ensure that the smaller datasets are repeated so that the encoder layer doesn't overtrain on the largest dataset.
Are there any references for people doing different batch sizes per task in the literature? I've only seen constant batch sizes with differing numbers of batches for each task which seems sufficient to prevent the impact of large datasets (Read 3.5.3 of the T5 paper for example).
Hi, regarding building T5 dataset , I think we can use datasets https://github.com/huggingface/datasets and then need something similar to tf.data.experimental.sample_from_datasets, do you know if similar functionality exist in pytorch? Which can sample multiple datasets with the given rates. thanks.
Is this feature part of a datasets
release yet?
Here's a fully working implementation based on the
__iter__
function of @zphang.
- I've generated the task choice list in the constructor as it allows us to index into the MultiDataset just like a normal dataset. I'm changing
task_choice_list
into a list of(dataset_idx, example_idx)
so each entry references a unique dataset example. The shuffling has to be done before this as we don't want to shuffle within each task (we assume this is done by the user if this is what they intend).- I'm slightly concerned this list could become very large if many large datasets were used. Can't see a way round it at the moment though.
- I've used
task.info.builder_name
as the dataset name. Not sure if this is correct.- I'd love to add some of the other
Dataset
methods (map, slicing by column, etc...). Would be great to implement the whole interface so a single dataset can be simply replaced by this.- This does everything on the individual example-level. If some application required batches all from a single task in turn we can't really do that.
import nlp import numpy as np class MultiDataset: def __init__(self,tasks): self.tasks = tasks # Create random order of tasks # Using size-proportional sampling task_choice_list = [] for i, task in enumerate(self.tasks): task_choice_list += [i] * len(task) task_choice_list = np.array(task_choice_list) np.random.shuffle(task_choice_list) # Add index into each dataset # - We don't want to shuffle within each task counters = {} self.task_choice_list = [] for i in range(len(task_choice_list)): idx = counters.get(task_choice_list[i],0) self.task_choice_list.append((task_choice_list[i],idx)) counters[task_choice_list[i]] = idx + 1 def __len__(self): return np.sum([len(t) for t in self.tasks]) def __repr__(self): task_str = ", ".join([str(t) for t in self.tasks]) return f"MultiDataset(tasks: {task_str})" def __getitem__(self,key): if isinstance(key, int): task_idx, example_idx = self.task_choice_list[key] task = self.tasks[task_idx] example = task[example_idx] example["task_name"] = task.info.builder_name return example elif isinstance(key, slice): raise NotImplementedError() def __iter__(self): for i in range(len(self)): yield self[i] def load_multitask(*datasets): '''Create multitask datasets per split''' def _get_common_splits(datasets): '''Finds the common splits present in all self.datasets''' min_set = None for dataset in datasets: if min_set != None: min_set.intersection(set(dataset.keys())) else: min_set = set(dataset.keys()) return min_set common_splits = _get_common_splits(datasets) out = {} for split in common_splits: out[split] = MultiDataset([d[split] for d in datasets]) return out ########################################## # Dataset Flattening def flatten(dataset,flatten_fn): for k in dataset.keys(): if isinstance(dataset[k],nlp.Dataset): dataset[k] = dataset[k].map(flatten_fn,remove_columns=dataset[k].column_names) # Squad def flatten_squad(example): return {"source": "squad context: " + example['context'] + " question: " + example['question'], "target":example["answers"]["text"]} squad = nlp.load_dataset("squad") flatten(squad,flatten_squad) # CNN_DM def flatten_cnn_dm(example): return {"source": "cnn_dm: " + example['article'],"target":[example["highlights"]]} cnn_dm = nlp.load_dataset("cnn_dailymail", "3.0.0") flatten(cnn_dm,flatten_cnn_dm) ############################################# mtds = load_multitask(squad,cnn_dm) for example in mtds["train"]: print(example["task_name"],example["target"])
Let me know if you have any thoughts. I've started using this in some of my projects and it seems to work. If people are happy with the general approach for a first version, I can make a pull request.
Not sure if this is what I'm looking for, but I implemented a version of Examples-Proportional mixing supporting only the basic feature here, seems to work in my project.
You can use interleave_datasets
to mix several datasets together. By default it alternates between all the datasets, but you can also provide sampling probabilities if you want to oversample from one of the datasets
from datasets import load_dataset, interleave_datasets
squad = load_dataset("squad", split="train")
cnn_dm = load_dataset("cnn_dailymail", "3.0.0", split="train")
ds = interleave_datasets([squad, cnn_dm])
print(ds[0])
# {'id': '5733be284776f41900661182',
# 'title': 'University_of_Notre_Dame',
# 'context': 'Architecturally, the school has a Catholic character...',
# 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
# 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]},
# 'article': None,
# 'highlights': None}
print(ds[1])
# {'id': '42c027e4ff9730fbb3de84c1af0d2c506e41c3e4',
# 'title': None,
# 'context': None,
# 'question': None,
# 'answers': None,
# 'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe...',
# 'highlights': "Harry Potter star Daniel Radcliffe..."}
see docs at https://huggingface.co/docs/datasets/v2.6.1/en/package_reference/main_classes#datasets.interleave_datasets
I also have this implementation of multi-task sampler here which I used it to tune T5: https://github.com/rabeehk/hyperformer/blob/main/hyperformer/data/multitask_sampler.py
It seems like many of the best performing models on the GLUE benchmark make some use of multitask learning (simultaneous training on multiple tasks).
The T5 paper highlights multiple ways of mixing the tasks together during finetuning:
Following this discussion https://github.com/huggingface/transformers/issues/4340 in transformers, @enzoampil suggested that the
nlp
library might be a better place for this functionality.Some method for combining datasets could be implemented ,e.g.
We would need a few additions:
It would be great to support common use cases such as pretraining on the GLUE benchmark before fine-tuning on each GLUE task in turn.
I'm willing to write bits/most of this I just need some guidance on the interface and other library details so I can integrate it properly.