Cifar10 and Resnet Code Compiles But Does Not Run to Completion

Nqabz commented 7 years ago

@hma02 : I now have the packages compiled correctly. However, still on running the BSP (or EASGD) based Cifar10_model (including ResNet) the behavior when running the code seems odd on my end:

#launch_session.py
from theanompi import BSP
#from theanompi import EASGD

#rule=EASGD()
rule=BSP()
# modelfile: the relative path to the model file
# modelclass: the class name of the model to be imported from that file
rule.init(devices=['cuda0', 'cuda1', 'cuda2'] ,
          modelfile = 'theanompi.models.cifar10',
          modelclass = 'Cifar10_model')
rule.wait()

Using cuDNN version 6021 on context None
Preallocating 10943/11519 Mb (0.950000) on cuda2
Mapped name None to device cuda2: Tesla K80 (0000:08:00.0)
Using cuDNN version 6021 on context None
Preallocating 10943/11519 Mb (0.950000) on cuda0
Mapped name None to device cuda0: Tesla K80 (0000:04:00.0)
Using Theano backend.
Using Theano backend.
Using Theano backend.
rank0: bad list is [], extended to 195
rank0: bad list is [], extended to 39
Cifar10_model
Layer Subtract       in (3, 32, 32, 256) --> out (3, 32, 32, 256)
Layer Crop       in [  3  32  32 256] --> out (3, 28, 28, 256)
Layer Dimshuffle         in [  3  28  28 256] --> out (256, 3, 28, 28)
Layer Conv (cudnn)   in [256   3  28  28] --> out (256, 64, 24, 24)
Layer Pool       in [256  64  24  24] --> out (256, 64, 12, 12)
Layer Conv (cudnn)   in [256  64  12  12] --> out (256, 128, 8, 8)
Layer Pool       in [256 128   8   8] --> out (256, 128, 4, 4)
Layer Conv (cudnn)   in [256 128   4   4] --> out (256, 64, 2, 2)
Layer Flatten        in [256  64   2   2] --> out (256, 256)
Layer FC         in [256 256] --> out (256, 256)
Layer Dropout0.5     in [256 256] --> out (256, 256)
Layer Softmax        in [256 256] --> out (256, 10)
[64  3  5  5]
[64]
[128  64   5   5]
[128]
[ 64 128   3   3]
[64]
[256 256]
[256]
[256  10]
[10]
model size 0.336 M floats
compiling training function...
compiling validation function...
Compile time: 3.236 s
calculating lr warming up power base: 1.246
learning rate 0.010000 will be used for epoch 0

The terminal output stays as above until my terminal session times out ... after more than 3 hrs at least? I tried using 1gpu, 2gpus, 3gpus and I still get the same behavior as above.

I checked my devices and the GPU utilization remains at 0% even though 95% memory is allocated.

+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:04:00.0     Off |                    0 |
| N/A   47C    P0    58W / 149W |  11081MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   38C    P0    72W / 149W |  11122MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:08:00.0     Off |                    0 |
| N/A   45C    P0    62W / 149W |  11122MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:09:00.0     Off |                    0 |
| N/A   29C    P8    30W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   33C    P8    26W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:86:00.0     Off |                    0 |
| N/A   28C    P8    29W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:89:00.0     Off |                    0 |
| N/A   34C    P8    25W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:8A:00.0     Off |                    0 |
| N/A   26C    P8    29W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     10841    C  python               11056MiB 
|    1     10842    C   python               11097MiB 
|    2     10844    C  python               11097MiB 
+-----------------------------------------------------------------------------+

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                     
22819 root      20   0  665952  24612   5540 R   5.6  0.0   0:00.17 node                                                                                        
   10 root      20   0       0      0      0 S   0.7  0.0 283:05.93 rcu_sched                                                                                                                                                                         
 3202 root      20   0       0      0      0 S   0.3  0.0   0:00.87 kworker/21:2
5571 root      20   0       0      0      0 S   0.3  0.0   0:01.34 kworker/6:9

Where do I change the device memory allocation in your code? Could this be due to memory allocation?

hma02 commented 7 years ago

I just made some changes regarding Cifar10_model and Wide_ResNet. You may want to pull it from master.

As for your hanging problem, I would recommend debugging it from here. Put some prints before and after

model.train_iter

and

exchanger.exchange

to see where it gets stuck. This is normally the way I use to debug it. I would say it probably gets stuck in the exchanging parts. If so, then check if the collectives here in NCCL works, you can make a toy code like this to test it. This depends on NCCL and libgpuarray/pygpu correctly installed.

I don't have the hanging problem even before my last commits. Here is how it runs now with Wide_ResNet.

mahe6562@cop8 8-2 $ nvidia-smi
Tue Aug  8 11:55:23 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.93     Driver Version: 352.93         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   49C    P0   146W / 149W |   2505MiB / 11519MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
| N/A   67C    P0   147W / 149W |   2485MiB / 11519MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:09:00.0     Off |                    0 |
| N/A   46C    P0   149W / 149W |   2530MiB / 11519MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:0A:00.0     Off |                    0 |
| N/A   37C    P8    30W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   22C    P8    26W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   26C    P8    29W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   21C    P8    25W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:89:00.0     Off |                    0 |
| N/A   25C    P8    28W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    169548    C   /opt/sharcnet/python/2.7.10/intel/bin/python  2448MiB |
|    1    169549    C   /opt/sharcnet/python/2.7.10/intel/bin/python  2428MiB |
|    2    169550    C   /opt/sharcnet/python/2.7.10/intel/bin/python  2473MiB |
+-----------------------------------------------------------------------------+

mahe6562@cop8 8-2 $ top
top - 11:56:23 up 216 days, 46 min,  3 users,  load average: 4.06, 3.88, 3.58
Tasks: 599 total,   4 running, 595 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  4.7%sy,  0.0%ni, 94.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  : 78.1%us, 21.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 76.5%us, 23.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 72.8%us, 27.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  98957020k total, 37396796k used, 61560224k free,  3371692k buffers
Swap:        0k total,        0k used,        0k free, 23229008k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                      
169550 mahe6562  20   0  199g 2.3g 143m R 100.2  2.4  69:23.97 python                                                                                                       
169548 mahe6562  20   0  199g 2.3g 143m R 99.8  2.4  68:26.97 python                                                                                                        
169549 mahe6562  20   0  199g 2.3g 143m R 99.8  2.4  67:17.42 python                                                                                                        
  4218 nobody    20   0  260m  49m 2088 S  3.3  0.1   3568:01 gmond                                                                                                         
    67 root      20   0     0    0    0 S  1.7  0.0   2346:01 events/0                                                                                                      
    68 root      20   0     0    0    0 S  0.3  0.0   1646:14 events/1                                                                                                      
172324 mahe6562  20   0 16332 1692  964 R  0.3  0.0   0:00.09 top                                                                                                           
     1 root      20   0 21452 1232  928 S  0.0  0.0   0:19.65 init                                                                                                          
     2 root      20   0     0    0    0 S  0.0  0.0   0:00.50 kthreadd                                                                                                      
     3 root      RT   0     0    0    0 S  0.0  0.0   1:12.79 migration/0

hma02 commented 7 years ago

@Nqabz

The memory allocation part looks weird to me. I don't have this configured anywhere ( like cnmem in .theanorc) and I don't see this in my standard output and error.

uoguelph-mlrg / Theano-MPI

Cifar10 and Resnet Code Compiles But Does Not Run to Completion #26