Closed abheesht17 closed 1 year ago
Great catch @abheesht17! It's pretty clear that the current layer is overfit to BERT. While we could build one layer that can do many things, I wonder if a refactor that had separate BERT-style and Roberta-style layers but possibly sharing common infra (like truncating and padding) would give us a cleaner API?
Another way of saying this is that if MultSegmentPacker
is really BertMultSegmentPacker
then some refactoring is needed. If there is a lot of variation between models it probably should go with the model code. If not, we should probably give each distinct modality its own method if they are quite different.
Just talked with @jbischof.
I think the laziest approach we can have here is to just keep a forked and unexported layer for the Roberta style packing inside the roberta model code for now. This isn't something we document or stick in an __init__.py
--it is only exposed for now in the high level RobertaPreprocessor
that goes along with BertPreprocessor
.
Then as a follow up, we can have all the discussion about which is the best "generic version" of the multi-segment packer to expose. Which will come with compatibility concerns too, how much do we weigh any existing usage in keras.io guides such.
Basically, I would propose the following approach when we hit stuff like this where we diverge from our "stock layer" offering:
1) If we have an obvious and backward compatible fix to our generic KerasNLP layer, then great! Let's do it. This was the case for encoder/decoder normalize_first
.
2) If we hit a place where the solution is unclear, then we just adopt the "fork first" approach and then discuss the harder resolution in generic KerasNLP layer. I think this packing might fall in this bucket.
Does that make sense to people?
Hopefully, without the need for segment ids, the roberta layer can be a bit simpler anyway.
This was fixs on https://github.com/keras-team/keras-nlp/pull/1046
@mattdangerw, @chenmoneygithub -
For the
MultiSegmentPacker
layer, we need to make one change.Currently, this is the output of the layer:
But
<end_token>
is not always used to separate two sequences. For example, this is how RoBERTa does it:, i.e.,
You can check this here: https://huggingface.co/docs/transformers/model_doc/xlm-roberta#transformers.XLMRobertaTokenizerFast.build_inputs_with_special_tokens
Secondly,
RoBERTa
does not havesegment_ids
. So, we can add another arg whether we want to returnsegment_ids
.I've tried out both of the above using HF: