makcedward / nlpaug

Data augmentation for NLP
https://makcedward.github.io/
MIT License
4.43k stars 462 forks source link

Queries regarding Contextual Word Embeddings Augmenter [BERT, etc.] #219

Closed katreparitosh closed 2 years ago

katreparitosh commented 3 years ago

Hi Edward,

First of all, it's a great piece of work created and open-sourced by you! Thanks a lot.

While using Contextual Word Embeddings - say BERT, DistilBERT, when I pass just one word and select action = "insert", then it adds a word before/after depending on context.

image

When I choose action = "substitute" - for n = 3

image

Q1. Could you help me understand why does this method output the same output for n times for most uni/bi-grams?

Q2. The PPDB outputs look like ['pretty', 'wonderful', 'lovely'] for the word "beautiful". How do we achieve a similar functionality through contextualized word embeddings? Why do the outputs repeat themselves for uni/bi-grams?

It would be great if you could advise on the above or give any directions.

Regards, Paritosh

makcedward commented 3 years ago

Q1: for n =3, it means it will generate 3 outputs by providing 1 input

Q2: For contextualized word embeddings, I am using masked language model (MLM). In short, some random token will be picked and replaced one by one. For example, Time0 (Inupt): "it's a great piece of work created and open-sourced by you", Time1 (first replacement): "it's a great piece of work and open-sourced by you", then generating it's a great piece of work initialized and open-sourced by you" Time2 (second replacement): "it's a great piece of work initialized and open-sourced by ", then generating it's a great piece of work initialized and open-sourced by us" and so on...