Language API design and implementation

mkaze commented 4 years ago

It's good to have a dedicated place for discussing ideas about the language API design and also the potential improvement/refactoring options in its implementation. Specifically, we are interested to find out what the drawbacks of the current API design (and/or implementation) is, and how it could be further improved.

Note that if we an issue/idea is big enough to merit having a dedicated issue or we decide to implement a particular idea/feature, feel free to create a new specific issue for that to be able to track them easily.

mkaze commented 4 years ago

As an example, we might be interested to make the internal design more structured by having a parent Language class which all the languages should inherit from this class. Besides giving a structure to the design, this might possibly result in some refactoring in language classes and prevent code duplication.

koaning commented 4 years ago

It would be nice if the only thing you'd have to implement for a language class is __getitem__. There is one main thing to consider design-wise here and that is the fact that not all backends have a vocabulary. Most BERT-style embeddings do not offer this which means that .retreive_similar() will only work on a subset of the Language backends. To accomodate this we can do two things;

Create a Mixin class with these features and only attach this to a subset of the language classes. We already do this to give each language model scikit-learn support.
Raise an appropriate error when using a language backend that does not support a vocabulary. The downside is that this is a property that must be manually set on each language. The main culprit here will be spaCy because I imagine that they will end up supporting both.

mkaze commented 4 years ago

Well, since the embeddings in general could be either context-free or contextualized, I think there is a one-to-one correspondence the type of embedding and the models (i.e. languages):

Context-free embeddings: the models providing this kind of embeddings are just a look-up table alongside a big matrix of values, e.g. Word2Vec, GloVe, Fasttext. They might potentially include some additional computation mechanisms, like handling OOV in Fasttext; but it's not relevant and usually handled internally. Further, that's why the syntax of [ ] (i.e. __getitem__) is more intuitive and suitable for these models, i.e. it's as if we want to fetch some data from a large data structure.
Contextualized embeddings: the models providing this kind of embeddings have a computation graph which might take a string containing one or more tokens as input and compute the embeddings for the entire string (i.e. TF hub models) or the tokens in it or both (i.e. Transformer-based models). That's why the syntax of ( ) (i.e. __call__) is more intuitive and suitable for these models, i.e. it's as if they are applying a function with a series of computations on a given input (however, I am not suggesting we should do this as well).

Now, besides the Language base class which would provide the base and common functionalities of language classes, we might have mixin classes like ContextFreeLanguage (for the first type) and ContextualizedLanguage (for the second type) to provide features/methods specific to each kind of language.

Although, there is the possibility of some obstacles in implementing such an abstraction for some of the models (which I cannot see or foresee now).

koaning commented 4 years ago

I'm wondering. What is actually common between all of our language backends? To me it seems like they should all implement __get_item__ but this method is very different in each language. There's some internal logic that we can add; how to deal with [bank] of the river and bank of the river but other than that it seems like there's not a whole lot of common ground.

The ContextFreeLanguage makes a lot of sense! I'm wonder if it makes sense to call it ContextFreeMixin or VocabLanguageMixin. I might like to start on writing this. Mainly because there's just too much repetition in the score_similar and embset_similar methods.

koaning commented 2 years ago

I'm closing issues because ever since the project moved to my personal account it's been more into maintenance mode than a "active work" mode.

koaning / whatlies

Language API design and implementation #203