keras-team / keras-hub

Pretrained model hub for Keras 3
Apache License 2.0
765 stars 232 forks source link

Support wrapping sequences across samples for LM tasks #701

Open mattdangerw opened 1 year ago

mattdangerw commented 1 year ago

Both RoBERTa and GPT2 pretraining leverage, wrapped, densely packed sequences for unsupervised language model learning. Essentially the training samples will look something like this (I'm omitting masking and labeling for clarity)...

The  qu   #ick  br   #own  fox    jump   #ed
over the  lazy  dog  .     </s>   The    lazy
dog  sle  #pt   un   #der  the    pale   moon

Essentially every sample will always have a full sequence length, and end of text markers need not line up with sample boundaries whatsoever. This has the advantage of being both simple and efficient, all weights are being trained perpetually during the unsupervised task.

We should consider if we want to support this as the task level, and if so, how, as this type of preprocessing is inexpressible with our preprocessing layer design.

mattdangerw commented 1 year ago

A few notes and musings on this design problem, which is quite an interesting one.

A few open questions we should investigate.

jbischof commented 1 year ago

My default strategy (not having looked into this myself) is that we should replicate prior art unless we can improve upon it. If BERT/RoBERTa repos offer a separate script for featurizing raw text data we can

  1. Have our preprocessors expect the output of these scripts
  2. Offer a version of these scripts outside the repo in the long run

This is part of an overarching "simple preprocessing" proposal I'm thinking about: make our task models fairly dumb and assume any complex preprocessing that will inevitably depend on the raw data format is already handled.

mattdangerw commented 1 year ago

The issue is going to be the uniformity of our task API. Right now all of our task models operate on raw strings. If we let BERT do what upstream BERT does, the input format for a BERT task will be tokenized, windowed and masked tf records (this is how our example is structured). If we let RoBERTa do what upstream RoBERTa does, the input format will tokenized and sharded files, not yet windowed or masked. (And it's unclear to me still if we can do everything RoBERTa does dynamically efficiently with tf.data)

We have to worry about the consistency of our task API. The obvious escape hatch (to me) is to show "pretraining recipes" with preprocessor=None. Then we could complete ship RoBERTa and BERT examples that have a slightly different breakdown of what preprocessing goes into what script.

To me, a bad outcome would be a API in which:

This would be really confusing and a significant point of friction. We are going to have to be editorial somewhat with these models, if we want our tasks to have consistent UX.