Closed tttthomasssss closed 2 years ago
Thank you for your feedback. Actually, this is something which is on the radar and being discussed (in #92) so that to find the best API for changing the default behavior of Language classes, e.g. give the user the ability to get token embeddings for all tokens in a given sentence, or change the reduction type for sentence embeddings (e.g. various types of pooling including sum, average, max, etc.), or in general control and customize any config of Language classes. So far the potential options include:
__init__
once when initializing the Language class, or__call__
method which besides the given sentence/token it could take configs as an additional argument, orset_config
) to control and modify configs.I do realise that we might need 4 settings: sum
, average
, max
and factory_settings
.
Mainly because if you have BERT-style embeddings then you never want to resort to sum
/average
/max
as a representation for your __CLS__
token. Since spaCy is likely to support both styles of embeddings (BERT and word-level) we probably need to allow for a factory setting as well to keep the API consistent.
@mkaze It feels like it's probably best to pass configs to __init__
. That way we can keep using __get_item__
. Or are there some concerns that I am missing?
It feels like it's probably best to pass configs to init. That way we can keep using __get_item__.
Sure, that's one way of implementing it which might make more sense at the moment. Now, do we want it to have a signature like def __init__(..., config)
(where config
could be a dictionary for example; this is good where the class might have a lot of config variables, e.g. Rasa components), or an argument per config like __init__(..., combiner='mean')
if the set of configs would be small? Of course, there is always the opportunity to refactor things and change the implementation in the future; but it's also not bad to have a small discussion about it at this stage.
I do realise that we might need 4 settings: sum, average, max and factory_settings. Mainly because if you have BERT-style embeddings then you never want to resort to sum/average/max as a representation for your CLS token.
I didn't get this part and couldn't follow your reasoning; please elaborate more.
It might be worth following what scikit-learn
does when both strings and callables are supported. For example in sklearn.metrics.pairwise_distances
, metric
can be a string such as 'cosine'
(as this is supported by the library internally), or it can be a callable. In the latter case, the API also has a **kwds
parameter, which passes its contents on to the custom metric
callable.
I don't have strong opinions either way or any opinion on whether the scikit-learn
behaviour is the best possible way, but its a behaviour many in the community will be familiar with.
@mkaze I've tried to draw what occurs internally in a vocabulary based model and a contextual BERT-style model.
The vocabulary based models might have inconsistent methods of representing the entire utterance. I think spaCy takes the sum
while there might be other vocabulary based models that take the mean
. For vocab-based models it makes sense to overwrite this via a setting that can be mean
, sum
or max
.
This library also needs to support the BERT style stuff and it get's trick there. Suppose that we have an utterance play ping pong
. Is it ever sensible to override the factory setting that the language backend gives?
For BERT style models you wouldn't want the representation for the entire utterance, say [play ping pong]
, to be defined as the mean/sum/max of it's parts. Instead you'd want to keep the factory setting. Because we support spaCy we cannot know for certain if the internal model is vocabulary based or contextual. That's why we should also have a factory
setting.
As a detail. Internally here at Rasa we refer to the __CLS__
token as a token that represents the entire utterance. I don't know if this is a Rasa internal thing or something that is common in the community so I figured I'd mention it here explicitly.
That's why we should also have a factory setting.
That's not different from having a default value of None
for the relevant config value and therefore use whatever the language backend does by default in that case; so no worries there!
For BERT style models you wouldn't want the representation for the entire utterance, say [play ping pong], to be defined as the mean/sum/max of it's parts.
Actually, that's not entirely true. The __CLS__
token which is present in some of the transformer models has been added and finetuned on downstream tasks so that it could be a good representation of the entire input sequence; however, nothing prevents you to use other alternative representation for the entire sequence based on the contextualized token embeddings of sequence, and also there is no guarantee that the __CLS__
token would out-perform all of the other representations. Actually, as I have already mentioned this in #92, the spacy
package uses average of token embeddings and the spacy-transformers
uses the sum of token embeddings (not the __CLS__
token). You can even go further, as I also mentioned this in #92, and use the representation given by the intermediate transformer layers or even the embedding layer itself. So this usually depends on the downstream task, the data or the analysis you want to perform (just as an example, see the result of different contextualized token embedding combinations in BERT paper for NER; here is a visual summary).
@mkaze ah right. Yeah, None
is better.
Your comment on __CLS__
is true with regards to the downstream task. I imagine though because of this downstream task that it's exactly this what you typically want to visualise. Then again, allow this flexibility to an end users sounds like a solid feature.
It might be worth following what
scikit-learn
does when both strings and callables are supported. For example insklearn.metrics.pairwise_distances
,metric
can be a string such as'cosine'
(as this is supported by the library internally), or it can be a callable. In the latter case, the API also has a**kwds
parameter, which passes its contents on to the custommetric
callable.
Supporting string metric values as well as callables is not an issue at all and could be done easily. However, I am not in favor of adding that **kwds
argument mainly because we already have that in some of the language classes (for example here in HFTransformers
) for passing additional arguments to underlying language backend constructor. So if we want to use __init__
for this purpose and also support the callables as well as custom keyword arguments for them, we should either:
__init__
like combiner_kwd
, orlambda
or using functools.partial
).Since the repository moved to my personal account I'm moving this project into maintenance mode. This issue, while interesting, likely won't get the attention it needs in the future.
If you currently use word-level embeddings (e.g. fastText),
whatlies
supports embeddings for sentences by summing the individual word embeddings. While this is reasonable default behaviour, its also an arbitrary and inflexible choice. Ideallywhatlies
can support standard encoding schemes such assum
,average
ormax
, and otherwise offer the use of callables for any custom operation that users want.