run_pretraining.py doesn't read all input files while training on desktop, global step doesn't work like epoch logic

Hello, I am training a model with 55 GB of raw data. The system I am using has a Titan RTX, Ryzen 3950x and 128 GB memory.

I splitted my corpus to 10 MB files for creating pretrain data via create_pretrain_data.py with 128 sequence, 10 dupe factor and 20 prediction. Each ptd file has almost 350 MB size.

When I run pretraining with 64 batch size, I noticed, from Ubuntu 18.04 System Monitor(Task Manager) process "disk total read" section, that run_pretraining.py reads total 425 MB ptd file within 5k training global step(this took 1 hour). As you see that 425 MB ptd = 13 MB raw Turkish corpus. That means, 55 GB training takes 180 days via too many training step.

Have you ever noticed that your training epochs don't read all your ptds while running on TPU cluster? Is it about running on desktop? Do training steps work different from epoch logic? If yes, what should I do for showing my all 55 GB corpus? I can decrease dupe factor to 1 but is it efficient?

And last, do you plan migrate your official codes to TF 2.1 and support multi-gpus with distributed strategy(mirrored)?

google-research / albert

run_pretraining.py doesn't read all input files while training on desktop, global step doesn't work like epoch logic #189