huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.26k stars 2.69k forks source link

Integration with AugLy #2622

Closed Darktex closed 1 year ago

Darktex commented 3 years ago

Is your feature request related to a problem? Please describe. Facebook recently launched a library, AugLy , that has a unified API for augmentations for image, video and text.

It would be pretty exciting to have it hooked up to HF libraries so that we can make NLP models robust to misspellings or to punctuation, or emojis etc. Plus, with Transformers supporting more CV use cases, having augmentations support becomes crucial.

Describe the solution you'd like The biggest difference between augmentations and preprocessing is that preprocessing happens only once, but you are running augmentations once per epoch. AugLy operates on text directly, so this breaks the typical workflow where we would run the tokenizer once, set format to pt tensors and be ready for the Dataloader.

Describe alternatives you've considered

One possible way of implementing these is to make a custom Dataset class where getitem(i) runs the augmentation and the tokenizer every time, though this would slow training down considerably given we wouldn't even run the tokenizer in batches.

mariosasko commented 3 years ago

Hi,

you can define your own custom formatting with Dataset.set_transform() and then run the tokenizer with the batches of augmented data as follows:

dset = load_dataset("imdb", split="train")  # Let's say we are working with the IMDB dataset
dset.set_transform(lambda ex: {"text": augly_text_augmentation(ex["text"])}, columns="text", output_all_columns=True)
dataloader = torch.utils.data.DataLoader(dset, batch_size=32)
for epoch in range(5):
    for batch in dataloader:
       tokenizer_output = tokenizer(batch.pop("text"), padding=True, truncation=True, return_tensors="pt")
       batch.update(tokenizer_output)
       output = model(**batch)
       ...
mariosasko commented 1 year ago

Preprocessing functions/augmentations, unless super generic, should be defined in separate libraries, so I'm closing this issue.