microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.53k stars 4.28k forks source link

Multiple GPU training on LSTMs #2446

Open abhimohta opened 7 years ago

abhimohta commented 7 years ago

By virtue of this - https://github.com/Microsoft/CNTK/issues/2214 I am not being able to use next_minibatch. I want to try and use distributed training on my seq2seq model but I can't use both the ways mentioned on your page(https://docs.microsoft.com/en-us/cognitive-toolkit/Multiple-GPUs-and-machines) training session and next_minibatch because of the above bug.

I am creating my own minibatches like done in https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106A_LSTM_Timeseries_with_Simulated_Data.ipynb. Is there a way I can implement multiple GPU training with this?

ke1337 commented 7 years ago

Yes, please refer to this code on how to feed data from python for distributed training. Basically, each worker runs its own process, and it's important to feed data according to its rank. Don't feed the same data to workers during distributed training, as it just effectively bumps up the learning rate.

abhimohta commented 7 years ago

I am saving models after every n(=100) epochs. I see 4 models being generated after 100 epochs (num_workers=4) and the performance is pathetic in comparison to the same 100 epoch run without distributed training. Is there something I need to take care while saving models?

I'm doing this -

for cur_epoch in range(1, args.max_epochs):
    minibatch_count = 0
    for inp_seq, gt_scores in next_batch(train_data[0], train_data[1]):
        if(minibatch_count % C.Communicator.num_workers()) == C.Communicator.rank():
            trainer.train_minibatch({x:inp_seq, golden:gt_scores})
            training_loss = trainer.previous_minibatch_loss_average
        minibatch_count += 1
    if cur_epoch % 100 == 0:
        prediction.save('model - %s - %d.cmf' % (guid, cur_epoch))

Also, the data is loaded 4 times(logical: once for every worker) and 4 processes also start. I'm curious to know of a way that it is 1 process and 4 distributed sub-processes and not 4 independent processes that are running. Is there a way to confirm that? The results at the 100-epoch model make me believe that it is the latter!

ke1337 commented 7 years ago

You have two ways to save model, Trainer.save_checkpoint or Function.save. Trainer knows about distributed training, so the actual saving would happen only in rank 0. Function does not, so you need to save only in rank 0 by:

if 0 == C.Communicator.rank():
    prediction.save('model - %s - %d.cmf' % (guid, cur_epoch))