google-research / uda

Unsupervised Data Augmentation (UDA)
https://arxiv.org/abs/1904.12848
Apache License 2.0
2.17k stars 313 forks source link

Link for downloading the back translation code is not working #108

Open sgmoo opened 3 years ago

sgmoo commented 3 years ago

While trying to run back_translate/download.sh, I get the following error:

> bash download.sh

--2021-06-19 12:36:11--  https://storage.googleapis.com/uda_model/text/back_trans_checkpoints.zip 
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.8.16, 172.217.9.208, 172.217.12.240, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.8.16|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-06-19 12:36:11 ERROR 404: Not Found.
unzip:  cannot find or open back_trans_checkpoints.zip, back_trans_checkpoints.zip.zip or back_trans_checkpoints.zip.ZIP.

It seems that the storage.googleapis.com/uda_model bucket is not valid anymore. Is there an alternate link that I can use to download the back_translate code?

JosephElHachem commented 3 years ago

Hello, I am experiencing the same issue and I hope it will be resolved soon !

sebamenabar commented 2 years ago

Hi, I have, the same problem, anybody managed to get the checkpoints?

YuandZhang commented 2 years ago

same issue. Have you sovle that problem?

sebamenabar commented 2 years ago

Maybe this could be of help, I made a small code to make the backtranslations with HuggingFace, although I have not tested the quality of the generated data, if they perform well with UDA, or the time it would take to translate the whole dataset, but visually they seem good. It works with transformers==4.4.2 and may require some modifications on newer versions.

import torch
from transformers import MarianMTModel, MarianTokenizer

torch.cuda.empty_cache()

en_fr_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
en_fr_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr").cuda()

fr_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
fr_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en").cuda()

src_text = [
    "Hi how are you?",
]

translated_tokens = en_fr_model.generate(
    **{k: v.cuda() for k, v in en_fr_tokenizer(src_text, return_tensors="pt", padding=True, max_length=512).items()},
    do_sample=True, 
    top_k=10, 
    temperature=2.0,
)
in_fr = [en_fr_tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens]

bt_tokens = fr_en_model.generate(
    **{k: v.cuda() for k, v in fr_en_tokenizer(in_fr, return_tensors="pt", padding=True, max_length=512).items()},
    do_sample=True, 
    top_k=10, 
    temperature=2.0,
)
in_en = [fr_en_tokenizer.decode(t, skip_special_tokens=True) for t in bt_tokens]

For the arguments used to generate please refer to https://huggingface.co/blog/how-to-generate.

Example of input data and backtranslation:

Input: I lived in Tokyo for 7 months. Knowing the reality of long train commutes, bike rides from the train station, soup stands, and other typical scenes depicted so well, certainly added to my own appreciation for this film which I really, really liked. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. Director Suo's tricks were subtle for the most part; I found his highlighting the character called Tamako Tamura with a soft filter, making her sublime, a tiny bit contrived but most of the directors tricks were so gentle that I was fully pulled in and just danced with his characters. Or cried. Or laughed aloud. Wonderful. A+.
---
Output: I lived in Tokyo for seven months. I know the reality of train rides, bike rides from the train station, soup stands, and other typical scenes shown so nicely, probably added to my own appreciation of this film I really, really loved. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. The pieces of the director Suo have been subtle to most, I found that he highlights the character called Tamaki Tamura with a sweet filter, which makes her sublime, a bit confused but most of the movie-makers' tricks were so soft that I was completely shot in it and just dancing with his characters. Or wept. or laughed aloud. Wonderful. A+.
Liu-Jingyao commented 2 years ago

Maybe this could be of help, I made a small code to make the backtranslations with HuggingFace, although I have not tested the quality of the generated data, if they perform well with UDA, or the time it would take to translate the whole dataset, but visually they seem good. It works with transformers==4.4.2 and may require some modifications on newer versions.

import torch
from transformers import MarianMTModel, MarianTokenizer

torch.cuda.empty_cache()

en_fr_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
en_fr_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr").cuda()

fr_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
fr_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en").cuda()

src_text = [
    "Hi how are you?",
]

translated_tokens = en_fr_model.generate(
    **{k: v.cuda() for k, v in en_fr_tokenizer(src_text, return_tensors="pt", padding=True, max_length=512).items()},
    do_sample=True, 
    top_k=10, 
    temperature=2.0,
)
in_fr = [en_fr_tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens]

bt_tokens = fr_en_model.generate(
    **{k: v.cuda() for k, v in fr_en_tokenizer(in_fr, return_tensors="pt", padding=True, max_length=512).items()},
    do_sample=True, 
    top_k=10, 
    temperature=2.0,
)
in_en = [fr_en_tokenizer.decode(t, skip_special_tokens=True) for t in bt_tokens]

For the arguments used to generate please refer to https://huggingface.co/blog/how-to-generate.

Example of input data and backtranslation:

Input: I lived in Tokyo for 7 months. Knowing the reality of long train commutes, bike rides from the train station, soup stands, and other typical scenes depicted so well, certainly added to my own appreciation for this film which I really, really liked. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. Director Suo's tricks were subtle for the most part; I found his highlighting the character called Tamako Tamura with a soft filter, making her sublime, a tiny bit contrived but most of the directors tricks were so gentle that I was fully pulled in and just danced with his characters. Or cried. Or laughed aloud. Wonderful. A+.
---
Output: I lived in Tokyo for seven months. I know the reality of train rides, bike rides from the train station, soup stands, and other typical scenes shown so nicely, probably added to my own appreciation of this film I really, really loved. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. The pieces of the director Suo have been subtle to most, I found that he highlights the character called Tamaki Tamura with a sweet filter, which makes her sublime, a bit confused but most of the movie-makers' tricks were so soft that I was completely shot in it and just dancing with his characters. Or wept. or laughed aloud. Wonderful. A+.

Thanks! I'll try it as a substitute for the source code.