makcedward / nlpaug

Data augmentation for NLP
https://makcedward.github.io/
MIT License
4.44k stars 463 forks source link

WordEmbeddings Random Insertion selects random word from entire vocab rather than from top-k closest words #250

Closed ARDivekar closed 2 years ago

ARDivekar commented 2 years ago

This is the line: https://github.com/makcedward/nlpaug/blob/master/nlpaug/augmenter/word/word_embs.py#L125

I think it should be self.model.predict(), but I am not sure.

makcedward commented 2 years ago

Yes. it should be top-k.

makcedward commented 2 years ago

After reviewing the whole process, it can be only drawn words from the entire corpus. Although we can leverage CBOW to predict the target word (i.e. newly inserted word), we are lack of trained neural networks output layer from those pre-trained models. Therefore, we cannot use this approach. Here is the flow of detail implementation in gensim