khanrc / pt.darts

PyTorch Implementation of DARTS: Differentiable Architecture Search
MIT License
439 stars 108 forks source link

How to use multi-gpu #2

Closed VectorYoung closed 5 years ago

VectorYoung commented 5 years ago

Hi, thanks for the nice implementation. I am trying to modify the codes to support multi-gpu but it didn't work out. I don't know how to parallel the Architect. Do you have any suggestions or are you going to add the multi-gpu feature? Thanks for your help.

khanrc commented 5 years ago

I was interested in multi-gpu implementation at the beginning. But at that time I did not have enough time to do that.

IMHO, there is nothing difficult to do with multi-gpu implementation. The only problem is that the multi-gpu api in pytorch is high-level. Looking roughly, it seems that we only need to know how to parallelize autograd.grad. I will also check this issue.

VectorYoung commented 5 years ago

@khanrc Thanks for your reply. I am new to pytorch and know little about parallel computing. I just follow the pytorch tutorial by modifying 'model = DataParallel(model)', but it didn't work out. I thought the virtual_step was not parallelized, but I don't know how to fix it. Thanks a lot for your help.

khanrc commented 5 years ago

3 Multi-gpu is added :)

0xsamgreen commented 5 years ago

Hi @khanrc,

Thanks for the PyTorch DARTS implementation! I am able to run the single GPU version of the code with no problem. However, I just hang forever with how I'm calling the multi-GPU version.

echo $CUDA_VISIBLE_DEVICES returns 0,1,2

nvidia-smi returns

Mon May  6 16:44:49 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:05:00.0 Off |                  N/A |
| 34%   48C    P0    60W / 250W |      0MiB / 12194MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 29%   44C    P0    57W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:09:00.0 Off |                  N/A |
| 27%   42C    P0    57W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 23%   38C    P8     9W / 250W |    159MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I then run with python search.py --name cifar10-mg --dataset cifar10 --gpus 0,1,2 --batch_size 256 --workers 16 --print_freq 10 --w_lr 0.1 --w_lr_min 0.004 --alpha_lr 0.0012

I then get the following output, but it hangs forever

--w_lr 0.1 --w_lr_min 0.004 --alpha_lr 0.0012
05/06 04:45:40 PM |
05/06 04:45:40 PM | Parameters:
05/06 04:45:40 PM | ALPHA_LR=0.0012
05/06 04:45:40 PM | ALPHA_WEIGHT_DECAY=0.001
05/06 04:45:40 PM | BATCH_SIZE=256
05/06 04:45:40 PM | DATA_PATH=./data/
05/06 04:45:40 PM | DATASET=cifar10
05/06 04:45:40 PM | EPOCHS=50
05/06 04:45:40 PM | GPUS=[0, 1, 2]
05/06 04:45:40 PM | INIT_CHANNELS=16
05/06 04:45:40 PM | LAYERS=8
05/06 04:45:40 PM | NAME=cifar10-mg
05/06 04:45:40 PM | PATH=searchs/cifar10-mg
05/06 04:45:40 PM | PLOT_PATH=searchs/cifar10-mg/plots
05/06 04:45:40 PM | PRINT_FREQ=10
05/06 04:45:40 PM | SEED=2
05/06 04:45:40 PM | W_GRAD_CLIP=5.0
05/06 04:45:40 PM | W_LR=0.1
05/06 04:45:40 PM | W_LR_MIN=0.004
05/06 04:45:40 PM | W_MOMENTUM=0.9
05/06 04:45:40 PM | W_WEIGHT_DECAY=0.0003
05/06 04:45:40 PM | WORKERS=16
05/06 04:45:40 PM |
05/06 04:45:40 PM | Logger is set - training start
Files already downloaded and verified
####### ALPHA #######
# Alpha - normal
tensor([[0.1249, 0.1252, 0.1249, 0.1249, 0.1249, 0.1252, 0.1249, 0.1250],
        [0.1249, 0.1251, 0.1250, 0.1250, 0.1252, 0.1249, 0.1250, 0.1250]],
       device='cuda:0', grad_fn=<SoftmaxBackward>)
tensor([[0.1249, 0.1253, 0.1250, 0.1250, 0.1249, 0.1250, 0.1250, 0.1249],
        [0.1251, 0.1247, 0.1249, 0.1253, 0.1248, 0.1249, 0.1251, 0.1253],
        [0.1250, 0.1250, 0.1249, 0.1251, 0.1252, 0.1250, 0.1251, 0.1249]],
       device='cuda:0', grad_fn=<SoftmaxBackward>)
tensor([[0.1250, 0.1249, 0.1250, 0.1250, 0.1250, 0.1251, 0.1249, 0.1251],
        [0.1250, 0.1249, 0.1251, 0.1249, 0.1249, 0.1253, 0.1252, 0.1248],
        [0.1250, 0.1250, 0.1251, 0.1248, 0.1251, 0.1250, 0.1251, 0.1250],
        [0.1250, 0.1251, 0.1250, 0.1251, 0.1251, 0.1251, 0.1249, 0.1248]],
       device='cuda:0', grad_fn=<SoftmaxBackward>)
tensor([[0.1250, 0.1250, 0.1249, 0.1249, 0.1251, 0.1250, 0.1249, 0.1252],
        [0.1250, 0.1252, 0.1252, 0.1248, 0.1250, 0.1249, 0.1248, 0.1251],
        [0.1251, 0.1251, 0.1250, 0.1251, 0.1248, 0.1250, 0.1249, 0.1249],
        [0.1249, 0.1250, 0.1250, 0.1253, 0.1251, 0.1251, 0.1247, 0.1249],
        [0.1251, 0.1251, 0.1252, 0.1249, 0.1249, 0.1251, 0.1250, 0.1249]],
       device='cuda:0', grad_fn=<SoftmaxBackward>)

# Alpha - reduce
tensor([[0.1248, 0.1249, 0.1249, 0.1251, 0.1250, 0.1250, 0.1251, 0.1251],
        [0.1250, 0.1248, 0.1249, 0.1252, 0.1249, 0.1250, 0.1249, 0.1251]],
       device='cuda:0', grad_fn=<SoftmaxBackward>)
tensor([[0.1251, 0.1249, 0.1252, 0.1249, 0.1250, 0.1250, 0.1249, 0.1250],
        [0.1250, 0.1249, 0.1250, 0.1251, 0.1251, 0.1249, 0.1250, 0.1251],
        [0.1250, 0.1250, 0.1250, 0.1249, 0.1250, 0.1250, 0.1250, 0.1252]],
       device='cuda:0', grad_fn=<SoftmaxBackward>)
tensor([[0.1251, 0.1247, 0.1249, 0.1252, 0.1252, 0.1249, 0.1251, 0.1249],
        [0.1250, 0.1250, 0.1250, 0.1251, 0.1253, 0.1249, 0.1249, 0.1248],
        [0.1250, 0.1252, 0.1250, 0.1250, 0.1251, 0.1248, 0.1251, 0.1249],
        [0.1250, 0.1251, 0.1251, 0.1249, 0.1248, 0.1251, 0.1250, 0.1250]],
       device='cuda:0', grad_fn=<SoftmaxBackward>)
tensor([[0.1250, 0.1249, 0.1251, 0.1251, 0.1251, 0.1249, 0.1250, 0.1250],
        [0.1252, 0.1249, 0.1251, 0.1251, 0.1250, 0.1249, 0.1249, 0.1249],
        [0.1249, 0.1250, 0.1251, 0.1250, 0.1249, 0.1250, 0.1250, 0.1252],
        [0.1250, 0.1249, 0.1250, 0.1251, 0.1250, 0.1251, 0.1250, 0.1249],
        [0.1249, 0.1251, 0.1250, 0.1251, 0.1248, 0.1251, 0.1249, 0.1251]],
       device='cuda:0', grad_fn=<SoftmaxBackward>)
#####################

Do you know if I'm doing something wrong?

0xsamgreen commented 5 years ago

For anyone else, I see that sometimes I need to restart the process to correct the above issue.