Open Nqabz opened 7 years ago
I just made some changes regarding Cifar10_model and Wide_ResNet. You may want to pull it from master.
As for your hanging problem, I would recommend debugging it from here. Put some prints before and after
model.train_iter
and
exchanger.exchange
to see where it gets stuck. This is normally the way I use to debug it. I would say it probably gets stuck in the exchanging parts. If so, then check if the collectives here in NCCL works, you can make a toy code like this to test it. This depends on NCCL and libgpuarray/pygpu correctly installed.
I don't have the hanging problem even before my last commits. Here is how it runs now with Wide_ResNet
.
mahe6562@cop8 8-2 $ nvidia-smi
Tue Aug 8 11:55:23 2017
+------------------------------------------------------+
| NVIDIA-SMI 352.93 Driver Version: 352.93 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 49C P0 146W / 149W | 2505MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 67C P0 147W / 149W | 2485MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:09:00.0 Off | 0 |
| N/A 46C P0 149W / 149W | 2530MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:0A:00.0 Off | 0 |
| N/A 37C P8 30W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 22C P8 26W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 26C P8 29W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 21C P8 25W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:89:00.0 Off | 0 |
| N/A 25C P8 28W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 169548 C /opt/sharcnet/python/2.7.10/intel/bin/python 2448MiB |
| 1 169549 C /opt/sharcnet/python/2.7.10/intel/bin/python 2428MiB |
| 2 169550 C /opt/sharcnet/python/2.7.10/intel/bin/python 2473MiB |
+-----------------------------------------------------------------------------+
mahe6562@cop8 8-2 $ top
top - 11:56:23 up 216 days, 46 min, 3 users, load average: 4.06, 3.88, 3.58
Tasks: 599 total, 4 running, 595 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.3%us, 4.7%sy, 0.0%ni, 94.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu2 : 78.1%us, 21.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 76.5%us, 23.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 72.8%us, 27.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 98957020k total, 37396796k used, 61560224k free, 3371692k buffers
Swap: 0k total, 0k used, 0k free, 23229008k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
169550 mahe6562 20 0 199g 2.3g 143m R 100.2 2.4 69:23.97 python
169548 mahe6562 20 0 199g 2.3g 143m R 99.8 2.4 68:26.97 python
169549 mahe6562 20 0 199g 2.3g 143m R 99.8 2.4 67:17.42 python
4218 nobody 20 0 260m 49m 2088 S 3.3 0.1 3568:01 gmond
67 root 20 0 0 0 0 S 1.7 0.0 2346:01 events/0
68 root 20 0 0 0 0 S 0.3 0.0 1646:14 events/1
172324 mahe6562 20 0 16332 1692 964 R 0.3 0.0 0:00.09 top
1 root 20 0 21452 1232 928 S 0.0 0.0 0:19.65 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.50 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 1:12.79 migration/0
@Nqabz
The memory allocation part looks weird to me. I don't have this configured anywhere ( like cnmem in .theanorc) and I don't see this in my standard output and error.
@hma02 : I now have the packages compiled correctly. However, still on running the BSP (or EASGD) based Cifar10_model (including ResNet) the behavior when running the code seems odd on my end:
The terminal output stays as above until my terminal session times out ... after more than 3 hrs at least? I tried using 1gpu, 2gpus, 3gpus and I still get the same behavior as above.
I checked my devices and the GPU utilization remains at 0% even though 95% memory is allocated.
Where do I change the device memory allocation in your code? Could this be due to memory allocation?