koaning / whatlies

Toolkit to help understand "what lies" in word embeddings. Also benchmarking!
https://koaning.github.io/whatlies/
Apache License 2.0
469 stars 50 forks source link

Support for custom Encoders #175

Closed tttthomasssss closed 2 years ago

tttthomasssss commented 4 years ago

If you currently use word-level embeddings (e.g. fastText), whatlies supports embeddings for sentences by summing the individual word embeddings. While this is reasonable default behaviour, its also an arbitrary and inflexible choice. Ideally whatlies can support standard encoding schemes such as sum, average or max, and otherwise offer the use of callables for any custom operation that users want.

mkaze commented 4 years ago

Thank you for your feedback. Actually, this is something which is on the radar and being discussed (in #92) so that to find the best API for changing the default behavior of Language classes, e.g. give the user the ability to get token embeddings for all tokens in a given sentence, or change the reduction type for sentence embeddings (e.g. various types of pooling including sum, average, max, etc.), or in general control and customize any config of Language classes. So far the potential options include:

koaning commented 4 years ago

I do realise that we might need 4 settings: sum, average, max and factory_settings.

Mainly because if you have BERT-style embeddings then you never want to resort to sum/average/max as a representation for your __CLS__ token. Since spaCy is likely to support both styles of embeddings (BERT and word-level) we probably need to allow for a factory setting as well to keep the API consistent.

koaning commented 4 years ago

@mkaze It feels like it's probably best to pass configs to __init__. That way we can keep using __get_item__. Or are there some concerns that I am missing?

mkaze commented 4 years ago

It feels like it's probably best to pass configs to init. That way we can keep using __get_item__.

Sure, that's one way of implementing it which might make more sense at the moment. Now, do we want it to have a signature like def __init__(..., config) (where config could be a dictionary for example; this is good where the class might have a lot of config variables, e.g. Rasa components), or an argument per config like __init__(..., combiner='mean') if the set of configs would be small? Of course, there is always the opportunity to refactor things and change the implementation in the future; but it's also not bad to have a small discussion about it at this stage.

I do realise that we might need 4 settings: sum, average, max and factory_settings. Mainly because if you have BERT-style embeddings then you never want to resort to sum/average/max as a representation for your CLS token.

I didn't get this part and couldn't follow your reasoning; please elaborate more.

tttthomasssss commented 4 years ago

It might be worth following what scikit-learn does when both strings and callables are supported. For example in sklearn.metrics.pairwise_distances, metric can be a string such as 'cosine' (as this is supported by the library internally), or it can be a callable. In the latter case, the API also has a **kwds parameter, which passes its contents on to the custom metric callable.

I don't have strong opinions either way or any opinion on whether the scikit-learn behaviour is the best possible way, but its a behaviour many in the community will be familiar with.

koaning commented 4 years ago

@mkaze I've tried to draw what occurs internally in a vocabulary based model and a contextual BERT-style model.

image

The vocabulary based models might have inconsistent methods of representing the entire utterance. I think spaCy takes the sum while there might be other vocabulary based models that take the mean. For vocab-based models it makes sense to overwrite this via a setting that can be mean, sum or max.

This library also needs to support the BERT style stuff and it get's trick there. Suppose that we have an utterance play ping pong. Is it ever sensible to override the factory setting that the language backend gives?

image

For BERT style models you wouldn't want the representation for the entire utterance, say [play ping pong], to be defined as the mean/sum/max of it's parts. Instead you'd want to keep the factory setting. Because we support spaCy we cannot know for certain if the internal model is vocabulary based or contextual. That's why we should also have a factory setting.

Notation Detail

As a detail. Internally here at Rasa we refer to the __CLS__ token as a token that represents the entire utterance. I don't know if this is a Rasa internal thing or something that is common in the community so I figured I'd mention it here explicitly.

mkaze commented 4 years ago

That's why we should also have a factory setting.

That's not different from having a default value of None for the relevant config value and therefore use whatever the language backend does by default in that case; so no worries there!

For BERT style models you wouldn't want the representation for the entire utterance, say [play ping pong], to be defined as the mean/sum/max of it's parts.

Actually, that's not entirely true. The __CLS__ token which is present in some of the transformer models has been added and finetuned on downstream tasks so that it could be a good representation of the entire input sequence; however, nothing prevents you to use other alternative representation for the entire sequence based on the contextualized token embeddings of sequence, and also there is no guarantee that the __CLS__ token would out-perform all of the other representations. Actually, as I have already mentioned this in #92, the spacy package uses average of token embeddings and the spacy-transformers uses the sum of token embeddings (not the __CLS__ token). You can even go further, as I also mentioned this in #92, and use the representation given by the intermediate transformer layers or even the embedding layer itself. So this usually depends on the downstream task, the data or the analysis you want to perform (just as an example, see the result of different contextualized token embedding combinations in BERT paper for NER; here is a visual summary).

koaning commented 4 years ago

@mkaze ah right. Yeah, None is better.

Your comment on __CLS__ is true with regards to the downstream task. I imagine though because of this downstream task that it's exactly this what you typically want to visualise. Then again, allow this flexibility to an end users sounds like a solid feature.

mkaze commented 4 years ago

It might be worth following what scikit-learn does when both strings and callables are supported. For example in sklearn.metrics.pairwise_distances, metric can be a string such as 'cosine' (as this is supported by the library internally), or it can be a callable. In the latter case, the API also has a **kwds parameter, which passes its contents on to the custom metric callable.

Supporting string metric values as well as callables is not an issue at all and could be done easily. However, I am not in favor of adding that **kwds argument mainly because we already have that in some of the language classes (for example here in HFTransformers) for passing additional arguments to underlying language backend constructor. So if we want to use __init__ for this purpose and also support the callables as well as custom keyword arguments for them, we should either:

koaning commented 2 years ago

Since the repository moved to my personal account I'm moving this project into maintenance mode. This issue, while interesting, likely won't get the attention it needs in the future.