owos / afri_augs

Data Augmentation for Generative models
1 stars 5 forks source link

Create a script to perform the switch-out augmentation technique #5

Open Iambusayor opened 8 months ago

Iambusayor commented 8 months ago

Measuring the Impact of Data Augmentation Methods for Extremely Low-Resource NMT explains the switch-out method. This technique views DA as an optimization problem, with the idea of randomly replacing words in both the source sentence and target sentence with other random words from their corresponding vocabularies. See the original paper SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation for more information. Here's a crude implementation https://github.com/MaximeNe/SwitchOut

r-chinonyelum commented 8 months ago

I'd like to be assigned this

r-chinonyelum commented 8 months ago

can I work on this task with someone please?

r-chinonyelum commented 8 months ago

There's a sketch of the script on the original paper. There's also the crude implementation available above. Please, how can I align this to become what we need?

r-chinonyelum commented 8 months ago

I am having the trouble of data types mismatch

Iambusayor commented 8 months ago

Do you want to post the portion of your code that produced the error with the traceback error, or have you solved it?

Iambusayor commented 8 months ago

There's a sketch of the script on the original paper. There's also the crude implementation available above. Please, how can I align this to become what we need?

Understand what the paper explains and see if the crude implementation suffices or you've to modify/write your own.

owos commented 8 months ago

Hi @lumnolar, so I took a look at this and here's how you could work with the switchout script.

import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from switchout import hamming_distance_sample

tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50")
tau = 0.2
bos_id = tokenizer.convert_tokens_to_ids(tokenizer.bos_token)
eos_id = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)
pad_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
vocab_size = tokenizer.vocab_size
padding = "max_length"

model_inputs1 = tokenizer(
    "I am a boy", padding="max_length", truncation=True, return_tensors="pt"
)

model_inputs2 = hamming_distance_sample(
    model_inputs1["input_ids"], tau, bos_id, eos_id, pad_id, tokenizer.vocab_size
)
model_inputs2_to_text = tokenizer.batch_decode(model_inputs2, skip_special_tokens=True)
print(model_inputs2_to_text)

#### working with our dataset:
data = load_dataset("masakhane/mafand", "en-yor")

def apply_switchout(examples):
    source_lang = "en"  # this should not be hard-coded
    target_lang = "yor"  # this should not be hard-coded

    inputs = [ex[source_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, padding=padding, truncation=True, return_tensors="pt"
    )
    model_inputs["input_ids"] = [
        hamming_distance_sample(
            inp.reshape(1, -1), tau, bos_id, eos_id, pad_id, vocab_size
        ).squeeze()
        for inp in model_inputs["input_ids"]
    ]
    targets = [ex[target_lang] for ex in examples["translation"]]
    labels = tokenizer(targets, padding=padding, truncation=True, return_tensors="pt")
    labels["input_ids"] = [
        hamming_distance_sample(
            trgt.reshape(1, -1), tau, bos_id, eos_id, pad_id, vocab_size
        ).squeeze()
        for trgt in labels["input_ids"]
    ]
    # what you will do moving forward from here will depend on the model you are working with, for this example I used mbart-50,
    if padding == "max_length" and False:  # data_args.ignore_pad_token_for_loss:
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label]
            for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_data = data.map(apply_switchout, batched=True)

Please play with code to see that you understand it.

Now the next step would be to implement 2 type of switchsout:

  1. In-language switchout: do replacement using the tokens from the language
  2. Random switchout: do replacement using random tokens

How to implement 1:

  1. change the last parameter in switchout i.e vocab_size to be a list and not an int.
  2. when you need to use the vocab_size, make the implementation sample and return to token id you want from the vocab_size list.
  3. To create the vocab_size list to be passed into the hamming_distance_sample function, pick the train split, tokenize it and convert to list, and remove duplicates.

Please respond with questions for any section that you do not understand.

We have also decided that you will be the one to carry out the ablation studies needed to get best tau value, please you need to start early.

r-chinonyelum commented 8 months ago

Alright. Thank you very much. I'm on it

Onoyiza commented 8 months ago

Hi @lumnolar. Can I work with you on this? Do you still need someone to work on this task with you?