google-research / uda

Unsupervised Data Augmentation (UDA)
https://arxiv.org/abs/1904.12848
Apache License 2.0
2.17k stars 312 forks source link

Noisy data generated by back translation #33

Open zwjyyc opened 5 years ago

zwjyyc commented 5 years ago

Very interesting work and thanks for sharing the code!

I am very interested in translation-based augmentation. I have generated some examples by running the run.sh, but some noisy ones are found and listed as follows:

(1) in forward generation; the input "could i get the address , phone number , and postcode of yu garden ?" and the output "The hotel is small location, the location is ideal and the food is fantastic.",

(2)in forward generation; the input "hi , i 'm looking for a nice german restaurant ." and the output "I was at listening to my room and we were even coming in the main area from 9 weeks. I also liked this hotel, this is a great boutique hotel."

(3)in forward generation; the input "i do n't care ." and the output "Sinon pour la plupart, je ne pense pas qu'il y ait un tel problème qui se pose à vous. Je n'ai pas l'intention de le faire."

Do you have any suggestions to avoid these errors?

Thanks!

michaelpulsewidth commented 5 years ago

You can lower the temperature. Please refer to the README for more information. As a sanity check, you can set the temperature to 0 and the model should generate perfectly valid but identical paraphrases.