Make Changes to `MultiSegmentPacker` Layer for RoBERTa

abheesht17 commented 2 years ago

@mattdangerw, @chenmoneygithub -

For the MultiSegmentPacker layer, we need to make one change.

Currently, this is the output of the layer:

<start_token>seq1<end_token>seq2<end_token><pad><pad>...

But <end_token> is not always used to separate two sequences. For example, this is how RoBERTa does it:

<s>seq1<\s><\s>seq2<\s><pad><pad>...

, i.e.,

<start_token> = <s>
<seq_splitting_token> = <\s><\s>
<end_token> = <\s>

You can check this here: https://huggingface.co/docs/transformers/model_doc/xlm-roberta#transformers.XLMRobertaTokenizerFast.build_inputs_with_special_tokens

Secondly, RoBERTa does not have segment_ids. So, we can add another arg whether we want to return segment_ids.

I've tried out both of the above using HF:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
>>> output = tokenizer(["hello, checking how this works."], text_pair=["it puts two end tokens between sequences."])
>>> output
{'input_ids': [[0, 33600, 31, 4, 175199, 3642, 903, 43240, 5, 2, 2, 442, 3884, 7, 6626, 3564, 47, 84694, 17721, 40, 26513, 5170, 5, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
>>> tokenizer.convert_ids_to_tokens(output["input_ids"][0])
['<s>',
 '▁hell',
 'o',
 ',',
 '▁checking',
 '▁how',
 '▁this',
 '▁works',
 '.',
 '</s>',
 '</s>',
 '▁it',
 '▁put',
 's',
 '▁two',
 '▁end',
 '▁to',
 'kens',
 '▁between',
 '▁se',
 'quen',
 'ces',
 '.',
 '</s>']

jbischof commented 2 years ago

Great catch @abheesht17! It's pretty clear that the current layer is overfit to BERT. While we could build one layer that can do many things, I wonder if a refactor that had separate BERT-style and Roberta-style layers but possibly sharing common infra (like truncating and padding) would give us a cleaner API?

Another way of saying this is that if MultSegmentPacker is really BertMultSegmentPacker then some refactoring is needed. If there is a lot of variation between models it probably should go with the model code. If not, we should probably give each distinct modality its own method if they are quite different.

mattdangerw commented 2 years ago

Just talked with @jbischof.

I think the laziest approach we can have here is to just keep a forked and unexported layer for the Roberta style packing inside the roberta model code for now. This isn't something we document or stick in an __init__.py--it is only exposed for now in the high level RobertaPreprocessor that goes along with BertPreprocessor.

Then as a follow up, we can have all the discussion about which is the best "generic version" of the multi-segment packer to expose. Which will come with compatibility concerns too, how much do we weigh any existing usage in keras.io guides such.

Basically, I would propose the following approach when we hit stuff like this where we diverge from our "stock layer" offering:

1) If we have an obvious and backward compatible fix to our generic KerasNLP layer, then great! Let's do it. This was the case for encoder/decoder normalize_first. 2) If we hit a place where the solution is unclear, then we just adopt the "fork first" approach and then discuss the harder resolution in generic KerasNLP layer. I think this packing might fall in this bucket.

Does that make sense to people?

mattdangerw commented 2 years ago

Hopefully, without the need for segment ids, the roberta layer can be a bit simpler anyway.

mattdangerw commented 1 year ago

This was fixs on https://github.com/keras-team/keras-nlp/pull/1046

keras-team / keras-hub

Make Changes to `MultiSegmentPacker` Layer for RoBERTa #368