Open Iambusayor opened 8 months ago
I'd like to be assigned this
can I work on this task with someone please?
There's a sketch of the script on the original paper. There's also the crude implementation available above. Please, how can I align this to become what we need?
I am having the trouble of data types mismatch
Do you want to post the portion of your code that produced the error with the traceback error, or have you solved it?
There's a sketch of the script on the original paper. There's also the crude implementation available above. Please, how can I align this to become what we need?
Understand what the paper explains and see if the crude implementation suffices or you've to modify/write your own.
Hi @lumnolar, so I took a look at this and here's how you could work with the switchout script.
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from switchout import hamming_distance_sample
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50")
tau = 0.2
bos_id = tokenizer.convert_tokens_to_ids(tokenizer.bos_token)
eos_id = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)
pad_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
vocab_size = tokenizer.vocab_size
padding = "max_length"
model_inputs1 = tokenizer(
"I am a boy", padding="max_length", truncation=True, return_tensors="pt"
)
model_inputs2 = hamming_distance_sample(
model_inputs1["input_ids"], tau, bos_id, eos_id, pad_id, tokenizer.vocab_size
)
model_inputs2_to_text = tokenizer.batch_decode(model_inputs2, skip_special_tokens=True)
print(model_inputs2_to_text)
#### working with our dataset:
data = load_dataset("masakhane/mafand", "en-yor")
def apply_switchout(examples):
source_lang = "en" # this should not be hard-coded
target_lang = "yor" # this should not be hard-coded
inputs = [ex[source_lang] for ex in examples["translation"]]
model_inputs = tokenizer(
inputs, padding=padding, truncation=True, return_tensors="pt"
)
model_inputs["input_ids"] = [
hamming_distance_sample(
inp.reshape(1, -1), tau, bos_id, eos_id, pad_id, vocab_size
).squeeze()
for inp in model_inputs["input_ids"]
]
targets = [ex[target_lang] for ex in examples["translation"]]
labels = tokenizer(targets, padding=padding, truncation=True, return_tensors="pt")
labels["input_ids"] = [
hamming_distance_sample(
trgt.reshape(1, -1), tau, bos_id, eos_id, pad_id, vocab_size
).squeeze()
for trgt in labels["input_ids"]
]
# what you will do moving forward from here will depend on the model you are working with, for this example I used mbart-50,
if padding == "max_length" and False: # data_args.ignore_pad_token_for_loss:
labels["input_ids"] = [
[(l if l != tokenizer.pad_token_id else -100) for l in label]
for label in labels["input_ids"]
]
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_data = data.map(apply_switchout, batched=True)
Please play with code to see that you understand it.
Now the next step would be to implement 2 type of switchsout:
How to implement 1:
Please respond with questions for any section that you do not understand.
We have also decided that you will be the one to carry out the ablation studies needed to get best tau value, please you need to start early.
Alright. Thank you very much. I'm on it
Hi @lumnolar. Can I work with you on this? Do you still need someone to work on this task with you?
Measuring the Impact of Data Augmentation Methods for Extremely Low-Resource NMT explains the switch-out method. This technique views DA as an optimization problem, with the idea of randomly replacing words in both the source sentence and target sentence with other random words from their corresponding vocabularies. See the original paper SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation for more information. Here's a crude implementation https://github.com/MaximeNe/SwitchOut