google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.24k stars 9.62k forks source link

Re-using gs://.../train.tf_record #585

Open bittlingmayer opened 5 years ago

bittlingmayer commented 5 years ago

When training at scale using preemptible TPUs, it's convenient to restart training from a checkpoint.

However, file_based_convert_examples_to_features will still be rerun each time. It's not fast - depending on the sequence-length, machine type etc, it takes more than an 1 hour for every 10M records.

Is there any reason that we can't check for a gs://.../train.tf_record, and re-use the one that is already written? (Assuming that the dataset hasn't changed - there could be a flag for that, or a checksum, or it could be implied by not passing any train.tsv.)

In either case, we could at least let it write in chunks instead of single lines.

Weizhuo-Zhang commented 5 years ago

Do we get charged when "writing examples" in Cloud TPU?

Weizhuo-Zhang commented 5 years ago

I think the "tf_record" files can be re-used only if your "task processor", "input file" and "tokenizer config" keep unchanged.

vgopinath commented 5 years ago

can anyone point me how to get train.tf_record & eval.tf_record files?