Problem when speeding up fine-tuning bert-base-uncased on ReCoRD

Issue by ThangPM Saturday Jun 27, 2020 at 16:53 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/1099

Hello,

I am trying to reproduce result for ReCoRD task by fine-tuning bert-base-uncased model but it takes days for 1 GPU (Tesla V100) because the training set is quite big (~1.13M examples).

python main.py --config jiant/config/superglue_bert.conf --overrides random_seed = 42, cuda = 0, run_name = record, pretrain_tasks = "record", target_tasks = "record", do_pretrain = 1, do_target_task_training = 0, do_full_eval = 1, batch_size = 8, val_interval = 10000, val_data_limit = -1

06/26 12:38:54 PM: Update 340556: task record, steps since last val 556 (total steps = 340556): f1: 0.5516, em: 0.5403, avg: 0.5459, record_loss: 0.1962 06/26 12:39:04 PM: Update 340603: task record, steps since last val 603 (total steps = 340603): f1: 0.5577, em: 0.5446, avg: 0.5512, record_loss: 0.1899

It takes 10 secs for (340603 - 340556) = 47 steps

I decided to speed up this process by using 8 GPUs (still Tesla V100) and update batch_size from 8 to 128 but it seems to take longer than 1 GPU.

python main.py --config jiant/config/superglue_bert.conf --overrides random_seed = 42, cuda = auto, run_name = record, pretrain_tasks = "record", target_tasks = "record", do_pretrain = 1, do_target_task_training = 0, do_full_eval = 1, batch_size = 128, val_interval = 10000, val_data_limit = -1

06/27 12:50:54 PM: Update 452155: task record, steps since last val 2155 (total steps = 452155): f1: 0.2494, em: 0.2414, avg: 0.2454, record_loss: 0.3956 06/27 12:51:08 PM: Update 452170: task record, steps since last val 2170 (total steps = 452170): f1: 0.2492, em: 0.2413, avg: 0.2453, record_loss: 0.3953

Now it takes around 14 secs for only 15 steps. Am I doing anything wrong or is this an issue?

Any comments would be appreciated.

nyu-mll / jiant-v1-legacy

Problem when speeding up fine-tuning bert-base-uncased on ReCoRD #1099