makcedward / nlpaug

Data augmentation for NLP
https://makcedward.github.io/
MIT License
4.45k stars 463 forks source link

BackTranslationAug(): ValueError: too many values to unpack (expected 2) #271

Open mskim94 opened 2 years ago

mskim94 commented 2 years ago

When I input the following code:

import nlpaug.augmenter.word as naw

text = 'The quick brown fox jumps over the lazy dog .'

aug = naw.BackTranslationAug(
    from_model_name='facebook/wmt19.en-de', 
    to_model_name='facebook/wmt19.de-en')

aug.augment(text)

I got the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-939f79f24f2f> in <module>
----> 1 aug.augment(text)

~/anaconda3/envs/augmenter/lib/python3.6/site-packages/nlpaug/base_augmenter.py in augment(self, data, n, num_thread)
     93             elif self.__class__.__name__ in ['AbstSummAug', 'BackTranslationAug', 'ContextualWordEmbsAug', 'ContextualWordEmbsForSentenceAug']:
     94                 for _ in range(aug_num):
---> 95                     result = action_fx(clean_data)
     96                     if isinstance(result, list):
     97                         augmented_results.extend(result)

~/anaconda3/envs/augmenter/lib/python3.6/site-packages/nlpaug/augmenter/word/back_translation.py in substitute(self, data, n)
     69             return data
     70 
---> 71         augmented_text = self.model.predict(data)
     72         return augmented_text
     73 

~/anaconda3/envs/augmenter/lib/python3.6/site-packages/nlpaug/model/lang_models/machine_translation_transformers.py in predict(self, texts, target_words, n)
     38 
     39     def predict(self, texts, target_words=None, n=1):
---> 40         src_translated_texts = self.translate_one_step_batched(texts, self.src_tokenizer, self.src_model)
     41         tgt_translated_texts = self.translate_one_step_batched(src_translated_texts, self.tgt_tokenizer, self.tgt_model)
     42 

~/anaconda3/envs/augmenter/lib/python3.6/site-packages/nlpaug/model/lang_models/machine_translation_transformers.py in translate_one_step_batched(self, data, tokenizer, model)
     58             for batch in tokenized_dataloader:
     59                 batch = tuple(t.to(self.device) for t in batch)
---> 60                 input_ids, attention_mask = batch
     61 
     62                 translated_ids_batch = model.generate(

ValueError: too many values to unpack (expected 2)

How can I solve this problem?

makcedward commented 2 years ago

which nlpaug version and transformers version are you using? My transformers version is 4.16.2

KimJaehee0725 commented 1 year ago

it is caused by latest transformers tokenizer often returns more than 2 types of outputs(eg. input_ids, token_type_ids, attention_mask)

how about update it? @mskim94 first of all, you can fix the method translate_one_step_batched easily like this :

import types 
from torch.utils import data as t_data
import torch
def translate_one_step_batched(
        self, data, tokenizer, model
):
    tokenized_texts = tokenizer(data, padding=True, truncation=True, return_tensors='pt')
    tokenized_dataset = t_data.TensorDataset(*(tokenized_texts.values()))        
    tokenized_dataloader = t_data.DataLoader(
        tokenized_dataset,
        batch_size=self.batch_size,
        shuffle=False,
        num_workers=1
    )

    all_translated_ids = []
    with torch.no_grad():
        for batch in tokenized_dataloader:
            batch = tuple(t.to(self.device) for t in batch)
            input_ids = batch[0]
            attention_mask = batch[2]

            translated_ids_batch = model.generate(
                input_ids = batch[0]
                attention_mask = batch[2]
                max_length=self.max_length
            )

            all_translated_ids.append(
                translated_ids_batch.detach().cpu().numpy()
            )

    all_translated_texts = []
    for translated_ids_batch in all_translated_ids:
        translated_texts = tokenizer.batch_decode(
            translated_ids_batch,
            skip_special_tokens=True
        )
        all_translated_texts.extend(translated_texts)

    return all_translated_texts

backtranslation.model.translate_one_step_batched = types.MethodType(translate_one_step_batched, backtranslation.model)

you have to make sure that the tokenizer you use returns (input_ids, *something, attention_mask

Hhx1999 commented 1 year ago

fix something: translated_ids_batch = model.generate( input_ids=input_ids, attention_mask=attention_mask, max_length=self.max_length )