quark0 / darts

Differentiable architecture search for convolutional and recurrent networks
https://arxiv.org/abs/1806.09055
Apache License 2.0
3.91k stars 845 forks source link

train_search on multi-gpus #37

Open JaminFong opened 6 years ago

JaminFong commented 6 years ago

Hello, quark! Thx for your great work. When I tried to run your train_search job with multi-gpus, the Variable of alphas_normal and alphas_reduce causes errors. The errors are shown as following:

File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 111, in forward s0, s1 = s1, cell(s0, s1, weights) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in forward s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in <genexpr> s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in forward return sum(w * op(x) for w, op in zip(weights, self._ops)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in <genexpr> return sum(w * op(x) for w, op in zip(weights, self._ops)) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:314 For debugging the code, I tried to remove 'w' which is from alphas_normal or alphas_reduce in return sum(w * op(x) for w, op in zip(weights, self._ops)) Both 0.3 and 0.4 version of PyTorch have been tried, but the problem got no improvement. Could you please tell me how I can deal with the multi-gpu training work? And have you ever met any similar problem like this? Best and waiting for your reply!

arunmallya commented 6 years ago

That's because the arch_parameters are not being copied onto every GPU. DataParallel only copies parameters and buffers of a module to all GPUs. In the above code, the arch_parameters are Variables and as a result, they do not get copied, hence the error. You can try making them parameters, but then you will have to override the parameters() function so that only weight parameters are returned, and not the arch parameters.

However, DataParallel will not give you any speed up in this case. It will in fact be very slow. This is because copying over the modules before every forward will take a loooot of time. There are around 5000 nested modules in the search network, whereas a large network like ResNet-101 has less then 400. This overhead will wipe out any possible benefit of data parallelization.

JaminFong commented 6 years ago

@arunmallya Thanks for your reply. And I'll have a try as you suggested. For the speed of of DataParallel, I think it may not help when the network is tiny, but I wanna apply darts to larger networks on larger datasets. By this way, only one gpu may not be able to afford the work.

quark0 commented 6 years ago

@arunmallya I agree with your points. Several people asked about this, but I haven't got the chance to try it myself.

@JaminFong An alternative approach is to further reduce the batch size/number of channels during search, though this might lead to some additional discrepancies between search & evaluation.

JaminFong commented 6 years ago

@quark0 Yes, I tried to reduce the number of layers or img size to fit my experiment within one gpu. But I think if we want to extend darts to larger scale task, data parallel may be necessary. Best!

VectorYoung commented 5 years ago

@JaminFong Hi, have you tried to implement it with multi-gpu? I am also going to search on a large task but one gpu is limited. Thanks.

JaminFong commented 5 years ago

@VectorYoung You could refer to https://github.com/JaminFong/darts-multi_gpu. I have implemented a multi-gpu one for the first order version.

QiuPaul commented 5 years ago

@JaminFong ,hi, thanks for your work. But when i run multi-gpu, it comes with this error below, have you met before?

logits = self.model(input_valid) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 69, in forward inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 80, in scatter return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 38, in scatter_kwargs inputs = scatter(inputs, target_gpus, dim) if inputs else [] File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 31, in scatter return scatter_map(inputs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 18, in scatter_map return list(zip(map(scatter_map, obj))) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 16, in scatter_map assert not torch.is_tensor(obj), "Tensors not supported in scatter." AssertionError: Tensors not supported in scatter.

JaminFong commented 5 years ago

@QiuPaul Hi, do u use the version of pytorch 0.3? If yes, it is best to use pytorch 1.0 instead or pytorch >=0.4 at least.

QiuPaul commented 5 years ago

@QiuPaul Hi, do u use the version of pytorch 0.3? If yes, it is best to use pytorch 1.0 instead or pytorch >=0.4 at least.

@JaminFong , yeah, thanks very much for your advice, it can run with pytorch1.0. What's more important , i found in your code, you make some modification: optimizer = torch.optim.SGD( weight_params, #model.parameters(), args.learning_rate, momentum=args.momentum, weight_decay=args.weight_decay)

Do you also find this problem below?? thanks... https://github.com/quark0/darts/issues/75

In paper , while not converged do

  1. Update weights w by descending GRADw(w; alpha)
  2. Update architecture alpha by descending GRADalpha(updated w; alpha) Which means: when update weights ,the alpha is fixed. However in original code below, when use momentum to update the weights, all parameters model.parameters() including arch_parameters are updated, waiting to be confirmed ,thanks....

optimizer = torch.optim.SGD( model.parameters(), args.learning_rate, momentum=args.momentum, weight_decay=args.weight_decay)

JaminFong commented 5 years ago

@QiuPaul In the original code, architecture parameters (alphas_normal and alphas_reduce) are not in model.parameters(). https://github.com/quark0/darts/blob/f276dd346a09ae3160f8e3aca5c7b193fda1da37/cnn/model_search.py#L123 Therefore, there is no need to filter the parameters in the orginal code. Please refer to the running mechanism of PyTorch.

marsggbo commented 5 years ago

@VectorYoung You could refer to https://github.com/JaminFong/darts-multi_gpu. I have implemented a multi-gpu one for the first order version.

Thanks so much for your work. I have run your code, but it seems that there is a problem according to the result (below figure) I get.

I run the code on Titan gpu and the batch size is 64.

image

The problem is that multi-gpus run even slower than single gpu.

The running info of gpus are as follows: image image

JaminFong commented 5 years ago

@marsggbo When using multi-gpu running, DataParallel in pytorch will take much more time to copy data into all the expected nodes before forward operations, especially the number of modules in the darts network is much larger than normal ones. So when your batch size is small it may not speed up the running to apply the model on multi gpus.

killawhale2 commented 5 years ago

@JaminFong I've looked at your implementation and only found instructions for running the 2nd order version of the algorithm. Could you specify the instructions on running the algorithm on just the first order?

Margrate commented 5 years ago

@JaminFong I've looked at your implementation and only found instructions for running the 2nd order version of the algorithm. Could you specify the instructions on running the algorithm on just the first order?

--unrolled False ?

xjtuzll commented 5 years ago

@JaminFong I run multi-gpu code on Titan gpu and the batch size is 64. It comes with this error below, have you met before?

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Network: Missing key(s) in state_dict: "alphas_reduce", "alphas_normal", "stem.0.weight", "stem.1.running_var", "stem.1.bias", "stem.1.weight", "stem.1.running_mean", "cells.0.preprocess0.op.1.weight", "cells.0.preprocess0.op.2.running_var", "cells.0.preprocess0.op.2.running_mean", "cells.0.preprocess1.op.1.weight", "cells.0.preprocess1.op.2.running_var", "cells.0.preprocess1.op.2.running_mean", "cells.0._ops.0._ops.1.1.running_var", "cells.0._ops.0._ops.1.1.running_mean", "cells.0._ops.0._ops.2.1.running_var", "cells.0._ops.0._ops.2.1.running_mean", "cells.0._ops.0._ops.4.op.1.weight", "cells.0._ops.0._ops.4.op.2.weight"......

JaminFong commented 5 years ago

@JaminFong I run multi-gpu code on Titan gpu and the batch size is 64. It comes with this error below, have you met before?

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Network: Missing key(s) in state_dict: "alphas_reduce", "alphas_normal", "stem.0.weight", "stem.1.running_var", "stem.1.bias", "stem.1.weight", "stem.1.running_mean", "cells.0.preprocess0.op.1.weight", "cells.0.preprocess0.op.2.running_var", "cells.0.preprocess0.op.2.running_mean", "cells.0.preprocess1.op.1.weight", "cells.0.preprocess1.op.2.running_var", "cells.0.preprocess1.op.2.running_mean", "cells.0._ops.0._ops.1.1.running_var", "cells.0._ops.0._ops.1.1.running_mean", "cells.0._ops.0._ops.2.1.running_var", "cells.0._ops.0._ops.2.1.running_mean", "cells.0._ops.0._ops.4.op.1.weight", "cells.0._ops.0._ops.4.op.2.weight"......

When u load the model from multi-gpu ones (or data_parallel ...), the params may come as module.***. You need to check the key names of the params dict.

Margrate commented 5 years ago

Can train.py run on muti-gpus? drop_path is not supported?

giangtranml commented 4 years ago

First-order approximation approach leads to worst performance comparing with Second-order approximation.

bitluozhuang commented 4 years ago

I have implemented a distributed PC-Darts for the first order version,https://github.com/bitluozhuang/Distributed-PC-Darts.Welcome to try it.