Support wrapping sequences across samples for LM tasks

mattdangerw commented 1 year ago

Both RoBERTa and GPT2 pretraining leverage, wrapped, densely packed sequences for unsupervised language model learning. Essentially the training samples will look something like this (I'm omitting masking and labeling for clarity)...

The  qu   #ick  br   #own  fox    jump   #ed
over the  lazy  dog  .     </s>   The    lazy
dog  sle  #pt   un   #der  the    pale   moon

Essentially every sample will always have a full sequence length, and end of text markers need not line up with sample boundaries whatsoever. This has the advantage of being both simple and efficient, all weights are being trained perpetually during the unsupervised task.

We should consider if we want to support this as the task level, and if so, how, as this type of preprocessing is inexpressible with our preprocessing layer design.

mattdangerw commented 1 year ago

A few notes and musings on this design problem, which is quite an interesting one.

This is not a strictly pretraining issue. As @chenmoneygithub has pointed out for GPT, this type of windowing is useful for fine-tuning a generative model as well.
There will never be a way we can write a preprocessing layer that supports this type of preprocessing! This is not an operation that can be expressed as a dataset.map(). This can be roughly expressed as ds.map(tokenizer).rebatch(sequence_length).batch(batch_size), but essentially this is an operation of an entire tf.data.Dataset stream. It could potentially be supported at the task level, but never with a single preprocessing layer.
We could choose not to support this out of the box! We have a way to express this with our raw tokenizers, and a task model with preprocessor=None. We could decide this is sufficient, with proper code examples.

A few open questions we should investigate.

How should we expect an input dataset to annotate where end of document markers are? Roberta pretraining expects a new document to be marked by an empty line. Is this sufficient?
Is fully dynamic preprocessing (e.g. raw text -> sample happening dynamically on the CPU), ever a reasonable call for pretraining? RoBERTa does dynamic masking dynamically, but will still tokenizer and shard files in a separate job. BERT does all of it's pretraining preprocessing in a separate job. This is something we should investigate experimentally, but if preprocessing in the training process is always a slowdown, that should inform our design.

jbischof commented 1 year ago

My default strategy (not having looked into this myself) is that we should replicate prior art unless we can improve upon it. If BERT/RoBERTa repos offer a separate script for featurizing raw text data we can

Have our preprocessors expect the output of these scripts
Offer a version of these scripts outside the repo in the long run

This is part of an overarching "simple preprocessing" proposal I'm thinking about: make our task models fairly dumb and assume any complex preprocessing that will inevitably depend on the raw data format is already handled.

mattdangerw commented 1 year ago

The issue is going to be the uniformity of our task API. Right now all of our task models operate on raw strings. If we let BERT do what upstream BERT does, the input format for a BERT task will be tokenized, windowed and masked tf records (this is how our example is structured). If we let RoBERTa do what upstream RoBERTa does, the input format will tokenized and sharded files, not yet windowed or masked. (And it's unclear to me still if we can do everything RoBERTa does dynamically efficiently with tf.data)

We have to worry about the consistency of our task API. The obvious escape hatch (to me) is to show "pretraining recipes" with preprocessor=None. Then we could complete ship RoBERTa and BERT examples that have a slightly different breakdown of what preprocessing goes into what script.

To me, a bad outcome would be a API in which:

our classification task models all expect strings
our LM task models expect a varying level of preprocessing depending on the model in question
our token classification tasks all expect tokenized input (with label re-mapping happening before the model)

This would be really confusing and a significant point of friction. We are going to have to be editorial somewhat with these models, if we want our tasks to have consistent UX.

keras-team / keras-hub

Support wrapping sequences across samples for LM tasks #701