[RFC] Dataset builders: to have or not to have?

🚀 Feature

The current design in torchtext presents the user with two APIs for dataset construction:

the "raw" API, which returns the raw text data from the dataset, and
the one-liner builder API that applies tokenization + vocab mapping + returns train + val + test datasets.

While I understand that building the vocabulary might be annoying, I think that it is important to have one recommended way of doing things in torchtext. The one-liner API solves a few problems well, but for maximum flexibility the user might need the raw API. But if the user is only used to the one-liner API, switching to the raw API might be non-trivial.

I propose that we instead favor the raw API, and illustrate with examples and tutorials how the vocabulary etc should be done.

Here are two examples. I'm using Map-style dataset for simplicity, deciding between map-style or iterable-style datasets is a topic for a different discussion

Example 1: Text classification

This is one example of what I would propose for a text classification dataset

class AGNews:
    def __init__(self, ..., src_transform=None, tgt_transform=None):
        self.src_transform = src_transform
        self.tgt_transform = tgt_transform

    def __getitem__(self, idx):
        ...
        if self.src_transform is not None:
            src = self.src_transform(src)
        if self.tgt_transform is not None:
            label = self.tgt_transform(label)

        return src, label

Then, the user would use the dataset in the following way

tokenizer = get_default_tokenizer(lang="en")
raw_dataset = AGNews(..., src_transform=tokenizer)
# or the equivalent API
vocab = build_vocab(raw_dataset)
# user can cache the vocab if they want
# or combine multiple datasets via ConcatDataset
# before creating the vocab, etc
...

# now create the datasets used for training
dataset_train = AGNews(..., split='train', src_transform=Compose([tokenizer, vocab]))
dataset_test = AGNews(..., split='test', src_transform=Compose([tokenizer, vocab]))

The current proposal adds two extra lines overhead to the user, but it teaches the user how to use torchtext for doing whatever they need.

Example 2: Translation

Here is an example for a translation dataset

class EnFrTranslation:
    def __getitem__(self, idx):
        ...
        if self.src_transform is not None:
            src = self.src_transform(src)
        if self.tgt_transform is not None:
            tgt = self.tgt_transform(tgt)
        return src, tgt

And the user then do in their code

tok_en = get_default_tokenizer(lang="en")
tok_fr = get_default_tokenizer(lang="fr")

# source data for creating the vocabulary
# can be the same dataset or a completely different one
# but it's explicit to the user on how they can obtain
# different vocabs, for example from unsupervised
# datasets where we don't have pairings
raw_dataset = EnFrTranslation(..., src_transform=tok_en, tgt_transform=tok_fr)

# build the vocabulary for each language
# independently
vocab_en, vocab_fr = Vocab(), Vocab()
for idx in range(len(raw_dataset)):
    src, tgt = raw_dataset[idx]
    vocab_en.add(src)
    vocab_fr.add(tgt)
vocab_en.finalize()
vocab_fr.finalize()

# now create the datasets used for training
# the model
dataset_train = EnFrTranslation(..., split='train',
        src_transform=Compose([tok_en, vocab_en]),
        tgt_transform=Compose([tok_fr, vocab_fr])
)
dataset_test = EnFrTranslation(..., split='test',
        src_transform=Compose([tok_en, vocab_en]),
        tgt_transform=Compose([tok_fr, vocab_fr])
)

Verbose, but explicit

While with the proposed way of laying down datasets makes it a bit more verbose for users to get started, the intent is clear for the beginning. A vocabulary is nothing more than a data transformation (akin to the transforms we use in torchvision), with the subtlety that it needs to be "trained", and how we "train" it is independent on the dataset.

One benefit for this being explicit is that the user has less opportunity to shoot themselves on the foot. As an example, re-using the same vocab while changing the tokenizer is a silent error with the "one-liner" API, as there is nothing that we can do to prevent them from mixing up different tokenizer and vocab. One could have expected it to just magically work.

Making it explicit puts the burden on the user on how to cache vocab and how to let the user perform the transformations themselves, and not on the library maintainer.

Towards Vocab within the model?

The above proposal also makes it clear that the vocabulary (and tokenizer) could be part of the model instead of the dataset. Indeed, a model is tightly coupled with a vocab (and the tokenizer as well), so once we have an efficient pmap in PyTorch we could just move them to the model. But that's a topic for a separate discussion.

Thanks @fmassa for the proposal. I think what you propose here is similar to the abstractions provided in experimental.datasets. For example, see TextClassificationDataset (here). So the question here is really "should we provide factory functions to return datasets with a single command". I had some discussions with @cpuhrsch before, and we thought about the option to phase out the "builders". Obviously, the builder functions may be too abstract. However, I could see some potential users, who just want to prototype some ideas with the processed data, try different tokenizers, ngram number, or even vocabularies. I would like to hear more feedback from our users side about this issue (cc @dongreenberg) and see if we could make a solid decision for this. cc @bentrevett Any feedback for this issue?

The only pain point I can see when trying to split a training set into training/validation, then building the vocabulary across the training set only.

Using a modified version of the code snippet provided above:

tokenizer = get_default_tokenizer(lang="en")
raw_train_dataset = AGNews(..., split='train', src_transform=tokenizer)
raw_train_dataset, raw_valid_dataset = dataset_split(raw_train_dataset) # user provided function? 
vocab = build_vocab(raw_train_dataset) # only want to build vocab on train, not valid!

# here I want to get the training set and split it into the exact same train/valid split as above
train_dataset = AGNews(..., split='train', src_transform=Compose([tokenizer, vocab]))

# how can I make this give the same split as used to construct the vocab?
train_dataset, valid_dataset = dataset_split(train_dataset)

I believe this can be handled by the user by them having their dataset_split function take in a seed to ensure both splits are the same. Other ideas: torchtext provides a dataset split function that works on all torchtext datasets. Or the torchtext datasets have a split method.

Another option is just overwriting the dataset's src_transform method:

tokenizer = get_default_tokenizer(lang="en")
train_dataset = AGNews(..., split='train', src_transform=tokenizer)
test_dataset = AGNews(..., split='test', src_transform=tokenizer)
train_dataset, valid_dataset = dataset_split(raw_train_dataset) 
vocab = build_vocab(raw_train_dataset)

train_dataset.src_transform = Compose([tokenizer, vocab])
valid_dataset.src_transform = Compose([tokenizer, vocab])
test_dataset.src_transform = Compose([tokenizer, vocab])

This looks more like a dirty hack though.

The way I've been handling it with the experimental API is getting the raw data, converting to lists and then shuffle/splitting the lists, building the vocab and then creating an instance of, e.g., TextClassificationDataset, see below:

def raw_data_to_dataset(raw_data: list[list[str]], tokenizer: callable, vocab: Vocab):

    text_transform = sequential_transforms(tokenizer.tokenize,
                                           vocab_func(vocab),
                                           totensor(dtype=torch.long))

    label_transform = sequential_transforms(lambda x: 1 if x == 'pos' else 0, 
                                            totensor(dtype=torch.long))

    transforms = (label_transform, text_transform)

    dataset = TextClassificationDataset(raw_data,
                                        vocab,
                                        transforms)

    return dataset

trying

@bentrevett Yup. I think the way you did with raw_data_to_dataset func is what we want users to learn and handle (check out raw datasets, build transform pipeline). And it's consistent with what @fmassa proposes here. FYI, pytorch provides a dataset split func: torch.utils.data.random_split.

Additionally, we would like to hear more opinions for "dataset builders". For example, AG_NEWS in text classification or WikiText2 in language modeling. Do you think those "dataset builders" are still useful or we should just provide the APIs for the raw datasets.

As part of the discussion, I put together a short review for the heterogeneity of the text datasets:

text classification datasets have texts and labels. For example, the AG_NEWS dataset (link) is a csv file in which labels and texts are separated by comma. IMDB dataset (link) contains thousands of individual files and each file has a single sentence. Positive and negative reviews are grouped in a separate folder, respectively.
language modeling datasets usually have multiple files with text sentences. For example, WikiText2 dataset (link) is a text file. BookCorpus dataset (see FAIR cluster /datasets01/bookcorpus/021819/) has multiple files, and each file represents a book.
question answer datasets have context/questions/answer/answer-positions in a JSON file (link).
translation datasets have a pair of source and target sentences saved in two separate files (en vs fr).

trying

@bentrevett Yup. I think the way you did with raw_data_to_dataset func is what we want users to learn and handle (check out raw datasets, build transform pipeline). And it's consistent with what @fmassa proposes here. FYI, pytorch provides a dataset split func: torch.utils.data.random_split.

Additionally, we would like to hear more opinions for "dataset builders". For example, AG_NEWS in text classification or WikiText2 in language modeling. Do you think those "dataset builders" are still useful or we should just provide the APIs for the raw datasets.

I think removing the builders is fine, as more datasets getting added there will have to be more and more edge cases to be written into the setup_datasets. Also means less code for torchtext to maintain.

I'd still keep the TextClassificationDataset, LanguageModelingDataset, etc. classes though, and then for each of them a have a short code example that would be similar to the setup_datasets functionality there is now.

pytorch / text