koaning / whatlies

Toolkit to help understand "what lies" in word embeddings. Also benchmarking!
https://koaning.github.io/whatlies/
Apache License 2.0
469 stars 50 forks source link

Language API design and implementation #203

Closed mkaze closed 2 years ago

mkaze commented 4 years ago

It's good to have a dedicated place for discussing ideas about the language API design and also the potential improvement/refactoring options in its implementation. Specifically, we are interested to find out what the drawbacks of the current API design (and/or implementation) is, and how it could be further improved.

Note that if we an issue/idea is big enough to merit having a dedicated issue or we decide to implement a particular idea/feature, feel free to create a new specific issue for that to be able to track them easily.

mkaze commented 4 years ago

As an example, we might be interested to make the internal design more structured by having a parent Language class which all the languages should inherit from this class. Besides giving a structure to the design, this might possibly result in some refactoring in language classes and prevent code duplication.

koaning commented 4 years ago

It would be nice if the only thing you'd have to implement for a language class is __getitem__. There is one main thing to consider design-wise here and that is the fact that not all backends have a vocabulary. Most BERT-style embeddings do not offer this which means that .retreive_similar() will only work on a subset of the Language backends. To accomodate this we can do two things;

  1. Create a Mixin class with these features and only attach this to a subset of the language classes. We already do this to give each language model scikit-learn support.
  2. Raise an appropriate error when using a language backend that does not support a vocabulary. The downside is that this is a property that must be manually set on each language. The main culprit here will be spaCy because I imagine that they will end up supporting both.
mkaze commented 4 years ago

Well, since the embeddings in general could be either context-free or contextualized, I think there is a one-to-one correspondence the type of embedding and the models (i.e. languages):

Now, besides the Language base class which would provide the base and common functionalities of language classes, we might have mixin classes like ContextFreeLanguage (for the first type) and ContextualizedLanguage (for the second type) to provide features/methods specific to each kind of language.

Although, there is the possibility of some obstacles in implementing such an abstraction for some of the models (which I cannot see or foresee now).

koaning commented 4 years ago

I'm wondering. What is actually common between all of our language backends? To me it seems like they should all implement __get_item__ but this method is very different in each language. There's some internal logic that we can add; how to deal with [bank] of the river and bank of the river but other than that it seems like there's not a whole lot of common ground.

The ContextFreeLanguage makes a lot of sense! I'm wonder if it makes sense to call it ContextFreeMixin or VocabLanguageMixin. I might like to start on writing this. Mainly because there's just too much repetition in the score_similar and embset_similar methods.

koaning commented 2 years ago

I'm closing issues because ever since the project moved to my personal account it's been more into maintenance mode than a "active work" mode.