iejMac / encoder-distill

Align embedding spaces of PyTorch encoders with common input types.
MIT License
4 stars 0 forks source link

Alignment might be easier with word sense disambiguated and/or averaged vectors #19

Open Thomas-MMJ opened 2 years ago

Thomas-MMJ commented 2 years ago

It might be easier to do alignment by adding additional vectors that are word sense disambiguated; as well as adding averaged vectors for noisy concepts.

For instance the large variety of female and male name vectors are likely extremely noisy, but their average is likely fairly similar in both vector spaces. Thus during alignment we might replace them with the average (after alignment we can rotate the original word vectors to align with the new vector space).

Also vectors with multiple word senses, will have greater emphasis of each sense in the different embedding - so bank might point more to 'river bank' in one embedding, and 'financial institution' in another embedding. Thus by creating additional word sense disambiguated tokens the concepts can be ensured to align more closely.

Lastly we can try and remove concept pollution from some tokens. Many tokens are polluted with meme related tokens. Again these might be making alignment difficult since one embedding might have more polution than the other. Perhaps do cos similarity with a list of common concepts (age, gender, meme, etc.) and for vectors that have a large difference in cos similarity between embeddings of the two models, avoid using them for embedding alignment.