luyug / GC-DPR

Train Dense Passage Retriever (DPR) with a single GPU
Other
128 stars 20 forks source link

Multi GPU support #2

Closed MXueguang closed 3 years ago

MXueguang commented 3 years ago

Hi, Is the current version of encoder training with GC support multiple GPUs? I tried to run the training with NQ dataset by following the instructions in README.md but on a machine with 2 GPUs. seems it is running slower than on a single GPU? i.e. on a single GPU, one step cost about 4 sec, but with two GPU, one step cost about 24 sec

luyug commented 3 years ago

We do have some local patches for multi cards but even the current TOT should not have overhead this big.

You can probably run a profiler to see what is bottlenecking it.

We can also help investigate the problem if you provide more information.

MXueguang commented 3 years ago

Hi @luyug, Thank you for your help. I loaded data and then ran two steps to see the time.

This is the head of the profile when I using two GPU (two 2080Ti, 11G).

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3   25.258    8.419   25.258    8.419 decoder.py:343(raw_decode)
       82    9.238    0.113    9.238    0.113 :0(run_backward)
  1156049    9.164    0.000   15.516    0.000 module.py:774(__setattr__)
2898444/367168    6.542    0.000    7.415    0.000 module.py:1215(named_modules)
     2114    4.500    0.002    4.500    0.002 :0(acquire)
3265133/3265132    3.398    0.000    3.397    0.000 :0(get)
      883    3.255    0.004    5.865    0.007 :0(read)
      310    3.243    0.010    3.243    0.010 :0(normal_)
       65    2.610    0.040    2.610    0.040 :0(utf_8_decode)
  2471319    2.546    0.000    2.548    0.000 :0(isinstance)
     2092    2.455    0.001    2.457    0.001 :0(to)
      504    2.263    0.004    2.263    0.004 :0(_scatter)
      168    1.831    0.011   29.583    0.176 replicate.py:78(replicate)
   145338    1.607    0.000   12.096    0.000 module.py:1376(_replicate_for_data_parallel)
      187    1.333    0.007    1.333    0.007 :0(_cuda_isDriverSufficient)
   955800    1.077    0.000    1.077    0.000 :0(items)
   136794    1.036    0.000    8.640    0.000 module.py:1048(_named_members)

v.s. The profile by running on single GPU:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3   25.166    8.389   25.166    8.389 decoder.py:343(raw_decode)
       82    3.695    0.045    3.695    0.045 :0(run_backward)
      857    3.231    0.004    5.836    0.007 :0(read)
      310    3.229    0.010    3.229    0.010 :0(normal_)
       74    2.605    0.035    2.605    0.035 :0(utf_8_decode)
     1838    2.414    0.001    2.415    0.001 :0(to)
      172    1.585    0.009    1.585    0.009 :0(_cuda_isDriverSufficient)
      292    0.869    0.003    0.869    0.003 :0(uniform_)
    15362    0.724    0.000    0.724    0.000 :0(matmul)
36480/160    0.507    0.000    3.585    0.022 module.py:710(_call_impl)
      398    0.387    0.001    0.387    0.001 :0(copy_)
      412    0.387    0.001    0.387    0.001 :0(_set_from_file)
    11680    0.272    0.000    1.047    0.000 functional.py:1655(linear)
        1    0.235    0.235   31.222   31.222 __init__.py:274(load)
    27907    0.199    0.000    0.350    0.000 module.py:774(__setattr__)
**************** CONFIGURATION **************** 
adam_betas                     -->   (0.9, 0.999)
adam_eps                       -->   1e-08
batch_size                     -->   128
checkpoint_file_name           -->   dpr_biencoder
ctx_chunk_size                 -->   8
dev_batch_size                 -->   16
dev_file                       -->   data/retriever/nq-dev.json
device                         -->   cuda
distributed_world_size         -->   1
do_lower_case                  -->   True
dropout                        -->   0.1
encoder_model_type             -->   hf_bert
eval_per_epoch                 -->   1
fix_ctx_encoder                -->   False
fp16                           -->   True
fp16_opt_level                 -->   O1
global_loss_buf_sz             -->   2097152
grad_cache                     -->   True
gradient_accumulation_steps    -->   1
hard_negatives                 -->   1
learning_rate                  -->   2e-05
local_rank                     -->   -1
log_batch_step                 -->   100
max_grad_norm                  -->   2.0
model_file                     -->   None
n_gpu                          -->   1
no_cuda                        -->   False
num_train_epochs               -->   40.0
other_negatives                -->   0
output_dir                     -->   model
pretrained_file                -->   None
pretrained_model_cfg           -->   bert-base-uncased
projection_dim                 -->   0
q_chunk_size                   -->   16
seed                           -->   12345
sequence_length                -->   256
shuffle_positive_ctx           -->   False
train_file                     -->   data/retriever/nq-train.json
train_files_upsample_rates     -->   None
train_rolling_loss_step        -->   100
val_av_rank_bsz                -->   128
val_av_rank_hard_neg           -->   30
val_av_rank_max_qs             -->   1000
val_av_rank_other_neg          -->   30
val_av_rank_start_epoch        -->   30
warmup_steps                   -->   1237
weight_decay                   -->   0.0
luyug commented 3 years ago

A few things

MXueguang commented 3 years ago

Ah, I launched with DP. Running with DDP works! Thanks for your help!