Rework examples/bert pretraining to use KerasNLP preprocessing

One potential way this could work:

Rework the split sentence script to go from a raw wikipedia dump and books text files -> to a set of sharded files with triples of entires with the form (sentence1, sentence2, next_sentence_label). The output file format should either be csv or tfrecords (and sharded).

This "data prep" script does not need to leverage tf.data at all, but we do want it to be simple and efficient when working the roughly ~20GB of input text that come with Bert pre-training. It will definitely need to efficiently use multithreading on a CPU, most likely with the multiprocessing model.

We will also need to take care to make sure the input sentences are of the correct length. There is a good bit of logic here about deciding when to split pretraining example between the first and second sentences. Recreating the logic exactly would require tokenizing the input just to figure out its token length, but I suspect we can be heuristic and just use the text's word count.

After reworking the data prep script, we would then need to do all the tokenization, segment packing, and word masking, using tf.data and the WordPieceTokenizer, MultiSegmentPacker and MLMMaskGenerator layers. This preprocessing will live inside the bert_train.py script. We will need to validate that this is performant both in terms of throughput and model quality.

keras-team / keras-nlp

Rework examples/bert pretraining to use KerasNLP preprocessing #347