Closed mkaze closed 2 years ago
As an example, we might be interested to make the internal design more structured by having a parent Language
class which all the languages should inherit from this class. Besides giving a structure to the design, this might possibly result in some refactoring in language classes and prevent code duplication.
It would be nice if the only thing you'd have to implement for a language class is __getitem__
. There is one main thing to consider design-wise here and that is the fact that not all backends have a vocabulary. Most BERT-style embeddings do not offer this which means that .retreive_similar()
will only work on a subset of the Language
backends. To accomodate this we can do two things;
Mixin
class with these features and only attach this to a subset of the language classes. We already do this to give each language model scikit-learn support. Well, since the embeddings in general could be either context-free or contextualized, I think there is a one-to-one correspondence the type of embedding and the models (i.e. languages):
Context-free embeddings: the models providing this kind of embeddings are just a look-up table alongside a big matrix of values, e.g. Word2Vec, GloVe, Fasttext. They might potentially include some additional computation mechanisms, like handling OOV in Fasttext; but it's not relevant and usually handled internally. Further, that's why the syntax of [ ]
(i.e. __getitem__
) is more intuitive and suitable for these models, i.e. it's as if we want to fetch some data from a large data structure.
Contextualized embeddings: the models providing this kind of embeddings have a computation graph which might take a string containing one or more tokens as input and compute the embeddings for the entire string (i.e. TF hub models) or the tokens in it or both (i.e. Transformer-based models). That's why the syntax of ( )
(i.e. __call__
) is more intuitive and suitable for these models, i.e. it's as if they are applying a function with a series of computations on a given input (however, I am not suggesting we should do this as well).
Now, besides the Language
base class which would provide the base and common functionalities of language classes, we might have mixin classes like ContextFreeLanguage
(for the first type) and ContextualizedLanguage
(for the second type) to provide features/methods specific to each kind of language.
Although, there is the possibility of some obstacles in implementing such an abstraction for some of the models (which I cannot see or foresee now).
I'm wondering. What is actually common between all of our language backends? To me it seems like they should all implement __get_item__
but this method is very different in each language. There's some internal logic that we can add; how to deal with [bank] of the river
and bank of the river
but other than that it seems like there's not a whole lot of common ground.
The ContextFreeLanguage
makes a lot of sense! I'm wonder if it makes sense to call it ContextFreeMixin
or VocabLanguageMixin
. I might like to start on writing this. Mainly because there's just too much repetition in the score_similar
and embset_similar
methods.
I'm closing issues because ever since the project moved to my personal account it's been more into maintenance mode than a "active work" mode.
It's good to have a dedicated place for discussing ideas about the language API design and also the potential improvement/refactoring options in its implementation. Specifically, we are interested to find out what the drawbacks of the current API design (and/or implementation) is, and how it could be further improved.
Note that if we an issue/idea is big enough to merit having a dedicated issue or we decide to implement a particular idea/feature, feel free to create a new specific issue for that to be able to track them easily.