Closed Darktex closed 1 year ago
Hi,
you can define your own custom formatting with Dataset.set_transform()
and then run the tokenizer with the batches of augmented data as follows:
dset = load_dataset("imdb", split="train") # Let's say we are working with the IMDB dataset
dset.set_transform(lambda ex: {"text": augly_text_augmentation(ex["text"])}, columns="text", output_all_columns=True)
dataloader = torch.utils.data.DataLoader(dset, batch_size=32)
for epoch in range(5):
for batch in dataloader:
tokenizer_output = tokenizer(batch.pop("text"), padding=True, truncation=True, return_tensors="pt")
batch.update(tokenizer_output)
output = model(**batch)
...
Preprocessing functions/augmentations, unless super generic, should be defined in separate libraries, so I'm closing this issue.
Is your feature request related to a problem? Please describe. Facebook recently launched a library, AugLy , that has a unified API for augmentations for image, video and text.
It would be pretty exciting to have it hooked up to HF libraries so that we can make NLP models robust to misspellings or to punctuation, or emojis etc. Plus, with Transformers supporting more CV use cases, having augmentations support becomes crucial.
Describe the solution you'd like The biggest difference between augmentations and preprocessing is that preprocessing happens only once, but you are running augmentations once per epoch. AugLy operates on text directly, so this breaks the typical workflow where we would run the tokenizer once, set format to pt tensors and be ready for the Dataloader.
Describe alternatives you've considered
One possible way of implementing these is to make a custom Dataset class where getitem(i) runs the augmentation and the tokenizer every time, though this would slow training down considerably given we wouldn't even run the tokenizer in batches.