google-research / bigbird

Transformers for Longer Sequences
https://arxiv.org/abs/2007.14062
Apache License 2.0
563 stars 101 forks source link

Roberta Training #3

Open agemagician opened 3 years ago

agemagician commented 3 years ago

Hello,

First, congratulations for your work.

Second, from what I have discovered so far, you only allow Bert like training and not Roberta training. Even if the NSP is set to false, still your script requires the "next_sentence_labels" field which is generated by Bert script.

My question is: How can we generator and train a model like Roberta, where there is only a single sequence per example without NSP.

@manzilz @ppham27 your feedback is highly appreciated. Thanks in advance for your reply.

manzilz commented 3 years ago

I think it should be possible to train like RoBERTa. If NSP is set to false, next_sentence_labels is just dummy. You can just fill it to be all 0 or 1. It isn't used for any gradient update. For example in the simple TFDS based example in bigbird/pretrain/run_pretraining.py we train with a single sequence per example without NSP and next_sentence_labels is set to all 0 in Line 193.

agemagician commented 3 years ago

Thanks a lot @manzilz for your reply.

So, we need to disable "preprocessed_data" and perform masking on the fly.

Could you please give us a concrete example for using on the fly masking with a local dataset for pre-traininng ?

From the code I can see we have to either use "tfds" online hub or tfrecrods which is already preprocessed.

Assuming I have a local folder with several text files like "data/data_*.txt", what is the correct commands that we should use to train Roberta like model ? or we must have the data on the tfds online hub ?

manzilz commented 3 years ago

We used tfrecords or TFDS for efficiency as reading large corpus of text files is not as efficient when training at large scale. I think it should easy to modify bigbird/pretrain/run_pretraining.py to read from local text files instead of TFDS. Instead of getting text from TFDS in line 297 we can insert code for generating a tf.Dataset from local text files. Then do_masking should handle on the fly masking and rest should work.

agemagician commented 3 years ago

Thanks again for your explanation.

Could you please provide the code for generating the tfrecords ?

Unfortunately, if we used Bert tfrecords generator, it will combine two sequences per sample.

Is it possible to share either:

  1. tfrecords generator script that produce a preprocessed masked ids with a single sentence per sample.
  2. tfrecords generator script that produce tfrecords with "example["text"]" in order to perform masking on the fly.

This will be super useful for training a new custom dataset.