Open sgmoo opened 3 years ago
Hello, I am experiencing the same issue and I hope it will be resolved soon !
Hi, I have, the same problem, anybody managed to get the checkpoints?
same issue. Have you sovle that problem?
Maybe this could be of help, I made a small code to make the backtranslations with HuggingFace, although I have not tested the quality of the generated data, if they perform well with UDA, or the time it would take to translate the whole dataset, but visually they seem good. It works with transformers==4.4.2
and may require some modifications on newer versions.
import torch
from transformers import MarianMTModel, MarianTokenizer
torch.cuda.empty_cache()
en_fr_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
en_fr_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr").cuda()
fr_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
fr_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en").cuda()
src_text = [
"Hi how are you?",
]
translated_tokens = en_fr_model.generate(
**{k: v.cuda() for k, v in en_fr_tokenizer(src_text, return_tensors="pt", padding=True, max_length=512).items()},
do_sample=True,
top_k=10,
temperature=2.0,
)
in_fr = [en_fr_tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens]
bt_tokens = fr_en_model.generate(
**{k: v.cuda() for k, v in fr_en_tokenizer(in_fr, return_tensors="pt", padding=True, max_length=512).items()},
do_sample=True,
top_k=10,
temperature=2.0,
)
in_en = [fr_en_tokenizer.decode(t, skip_special_tokens=True) for t in bt_tokens]
For the arguments used to generate please refer to https://huggingface.co/blog/how-to-generate.
Example of input data and backtranslation:
Input: I lived in Tokyo for 7 months. Knowing the reality of long train commutes, bike rides from the train station, soup stands, and other typical scenes depicted so well, certainly added to my own appreciation for this film which I really, really liked. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. Director Suo's tricks were subtle for the most part; I found his highlighting the character called Tamako Tamura with a soft filter, making her sublime, a tiny bit contrived but most of the directors tricks were so gentle that I was fully pulled in and just danced with his characters. Or cried. Or laughed aloud. Wonderful. A+.
---
Output: I lived in Tokyo for seven months. I know the reality of train rides, bike rides from the train station, soup stands, and other typical scenes shown so nicely, probably added to my own appreciation of this film I really, really loved. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. The pieces of the director Suo have been subtle to most, I found that he highlights the character called Tamaki Tamura with a sweet filter, which makes her sublime, a bit confused but most of the movie-makers' tricks were so soft that I was completely shot in it and just dancing with his characters. Or wept. or laughed aloud. Wonderful. A+.
Maybe this could be of help, I made a small code to make the backtranslations with HuggingFace, although I have not tested the quality of the generated data, if they perform well with UDA, or the time it would take to translate the whole dataset, but visually they seem good. It works with
transformers==4.4.2
and may require some modifications on newer versions.import torch from transformers import MarianMTModel, MarianTokenizer torch.cuda.empty_cache() en_fr_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr") en_fr_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr").cuda() fr_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en") fr_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en").cuda() src_text = [ "Hi how are you?", ] translated_tokens = en_fr_model.generate( **{k: v.cuda() for k, v in en_fr_tokenizer(src_text, return_tensors="pt", padding=True, max_length=512).items()}, do_sample=True, top_k=10, temperature=2.0, ) in_fr = [en_fr_tokenizer.decode(t, skip_special_tokens=True) for t in translated_tokens] bt_tokens = fr_en_model.generate( **{k: v.cuda() for k, v in fr_en_tokenizer(in_fr, return_tensors="pt", padding=True, max_length=512).items()}, do_sample=True, top_k=10, temperature=2.0, ) in_en = [fr_en_tokenizer.decode(t, skip_special_tokens=True) for t in bt_tokens]
For the arguments used to generate please refer to https://huggingface.co/blog/how-to-generate.
Example of input data and backtranslation:
Input: I lived in Tokyo for 7 months. Knowing the reality of long train commutes, bike rides from the train station, soup stands, and other typical scenes depicted so well, certainly added to my own appreciation for this film which I really, really liked. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. Director Suo's tricks were subtle for the most part; I found his highlighting the character called Tamako Tamura with a soft filter, making her sublime, a tiny bit contrived but most of the directors tricks were so gentle that I was fully pulled in and just danced with his characters. Or cried. Or laughed aloud. Wonderful. A+. --- Output: I lived in Tokyo for seven months. I know the reality of train rides, bike rides from the train station, soup stands, and other typical scenes shown so nicely, probably added to my own appreciation of this film I really, really loved. There are aspects of Japanese life in this film painted with vivid colors but you don't have to speak Japanese to enjoy this movie. The pieces of the director Suo have been subtle to most, I found that he highlights the character called Tamaki Tamura with a sweet filter, which makes her sublime, a bit confused but most of the movie-makers' tricks were so soft that I was completely shot in it and just dancing with his characters. Or wept. or laughed aloud. Wonderful. A+.
Thanks! I'll try it as a substitute for the source code.
While trying to run back_translate/download.sh, I get the following error:
It seems that the storage.googleapis.com/uda_model bucket is not valid anymore. Is there an alternate link that I can use to download the back_translate code?