Issue with Data Augmentation in LOVE Reproduction

tigerchen52 / LOVE

ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

MIT License

39 stars 7 forks source link

Issue with Data Augmentation in LOVE Reproduction #7

Closed jej127 closed 1 year ago

jej127 commented 1 year ago

Hello! I am currently trying to reproduce the LOVE model, but I have encountered an issue with data augmentation.

Specifically, the paper mentions that one of the strategies for data augmentation is to replace the original word with a synonymous word. However, I noticed that the 'data/synonym.txt' file does not contain the full set of 2M vocabulary as expected.

Could you please provide the complete 'data/synonym.txt' file or, alternatively, share the code that can be used to generate this file? Thank you for your assistance!

tigerchen52 commented 1 year ago

Hi,

The synonym.txt sample file is not the final one used in our experiment. For your reference, you can use the code below to generate synonymous words:

from itertools import chain
from nltk.corpus import wordnet

synonyms = wordnet.synsets('car')
lemmas = set(chain.from_iterable([word.lemma_names() for word in synonyms]))
print(lemmas)

Best,

Lihu

jej127 commented 1 year ago

Thanks, it's really helpful, and I close the issue.