Trying to use 2 GPUs results in neverending process that can't be killed without restart

RuABraun commented 4 years ago

I have two 1080tis. When I try and use kmcuda with both by setting CUDA_VISIBLE_DEVICES to the GPUs I use for compute, (I've tried this out with device=0 and device=3), it gets stuck on transposing the samples..., util shows 100%, but power usage isn't high at all. Data has shape (69673, 256).

I've waited a couple minutes, it takes half a minute usually on 1 gpu.

If I control-c/z it or kill the ID, the process becomes a zombie and I'm forced to restart to kill it (and use the gpus again).

vmarkovtsev commented 4 years ago

I wonder if this is like a bad magic shape. Can you please try sizes from 69600 till 69700?

Besides, can you please post the exact log at maximum verbosity level. Also, are you using Python, R or native interface? Linux, Windows, or MacOS? CUDA version? Did you compile yourself or used the wheel?

RuABraun commented 4 years ago

I'm using the python interface. Ubuntu 16.04, CUDA 10.2.

I compiled using instructions in the README. Then moved the .so to where I'm using it.

It fails for me with 60 000 as well.

Here's the output (still running):

seni@seni-MS-7A32:/work/fun/subword-repr$ CUDA_VISIBLE_DEVICES="0,1" py3 km.py data/repr_nums_small                                                                                                         
2020-01-29 20:24:33.629 | INFO     | __main__:kmeans:11 - Starting, data shape is (60000, 256).
arguments: 1 0x7ffd7a3368e4 0.001 0.10 0 60000 256 6000 3 0 0 2 0x7ff717f96010 0x7ff7179b9010 0x2e1bae0 0x7ffd7a33690c
reassignments threshold: 60
yinyang groups: 600
GPU #0 memory: used 318242816 bytes (2.7%), free 11403264000 bytes, total 11721506816 bytes
GPU #1 memory: used 318242816 bytes (2.7%), free 11401887744 bytes, total 11720130560 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(1875, 8), (8, 32)>>> 60000, 256, xyswap

Here's nvidia-smi

seni@seni-MS-7A32:/work/fun/subword-repr$ nvidia-smi 
Wed Jan 29 20:31:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 1030     Off  | 00000000:08:00.0  On |                  N/A |
| 43%   46C    P0    N/A /  30W |    391MiB /  2001MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 31%   55C    P2    82W / 250W |    431MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:42:00.0  On |                  N/A |
| 21%   56C    P2    88W / 250W |    431MiB / 11177MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1752      G   /usr/lib/xorg/Xorg                           222MiB |
|    0      2545      G   compiz                                       148MiB |
|    0      3636      G   ...quest-channel-token=9743865575637794024    18MiB |
|    1    116273      C   python3.6                                    291MiB |
|    2    116273      C   python3.6                                    291MiB |
+-----------------------------------------------------------------------------+

Maybe it's because of Exclusive Process?

RuABraun commented 4 years ago

With verbose=3

arguments: 1 0x7ffda37bcbf4 0.001 0.10 0 60000 256 6000 3 0 0 3 0x7fcd46527010 0x7fcd45f4a010 0x1f43050 0x7ffda37bcc1c
reassignments threshold: 60
yinyang groups: 600
[0] *dest: 0x7fcd18000000 - 0x7fcd1ba98000 (61440000)
[1] *dest: 0x7fccfc000000 - 0x7fccffa98000 (61440000)
[0] device_centroids: 0x7fcd07000000 - 0x7fcd075dc000 (6144000)
[1] device_centroids: 0x7fcd07600000 - 0x7fcd07bdc000 (6144000)
[0] device_assignments: 0x7fccffc00000 - 0x7fccffc3a980 (240000)
[1] device_assignments: 0x7fccffe00000 - 0x7fccffe3a980 (240000)
[0] device_assignments_prev: 0x7fccffc3aa00 - 0x7fccffc75380 (240000)
[1] device_assignments_prev: 0x7fccffe3aa00 - 0x7fccffe75380 (240000)
[0] device_ccounts: 0x7fccffc75400 - 0x7fccffc7b1c0 (24000)
[1] device_ccounts: 0x7fccffe75400 - 0x7fccffe7b1c0 (24000)
[0] device_assignments_yy: 0x7fccffc7b200 - 0x7fccffc80fc0 (24000)
[1] device_assignments_yy: 0x7fccffe7b200 - 0x7fccffe80fc0 (24000)
[0] device_bounds_yy: 0x7fccf6000000 - 0x7fccfa4c76c0 (72120000)
[1] device_bounds_yy: 0x7fccf0000000 - 0x7fccf44c76c0 (72120000)
[0] device_drifts_yy: 0x7fccf4600000 - 0x7fccf4be1dc0 (6168000)
[1] device_drifts_yy: 0x7fccf4c00000 - 0x7fccf51e1dc0 (6168000)
[0] device_passed_yy: 0x7fccffc81000 - 0x7fccffc9e4c0 (120000)
[1] device_passed_yy: 0x7fccffe81000 - 0x7fccffe9e4c0 (120000)
[0] device_centroids_yy: 0x7fccffc9e600 - 0x7fccffd34600 (614400)
[1] device_centroids_yy: 0x7fccffe9e600 - 0x7fccfff34600 (614400)
GPU #0 memory: used 318242816 bytes (2.7%), free 11403264000 bytes, total 11721506816 bytes
GPU #1 memory: used 318242816 bytes (2.7%), free 11401887744 bytes, total 11720130560 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(1875, 8), (8, 32)>>> 60000, 256, xyswap

vmarkovtsev commented 4 years ago

100% GPU means that the code entered an infinite cycle... It is at https://github.com/src-d/kmcuda/blob/master/src/transpose.cu#L30

CUDA 10+ has not been tested yet, so this must be a code compatibility problem and not a "real" bug. Let's try three things:

Remove volatile in all the *.cu files. Then recompile the .so. I observed some miscompilations on the newer CUDA versions and even reported them to NVIDIA, but never had time to convince them that there was a problem. The efficiency will likely degrade, but we are trying to make the code work at all.
Is there anything relevant printed in dmesg?
Run the code with volatile under cuda-memcheck. It will run 20x times slower but we should see the problems reported, if any.

RuABraun commented 4 years ago

Did 1. with sed -i 's/volatile//g' src/*cu, compiled and ran again, also hanging.

Sorry I'm not that experienced with this sort of thing what do you mean with "anything relevant printed in dmesg"? If I run dmesg I get lots of output, the only part that is probably related is

[ 1208.838449] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090000040 flags=0x0020]
[ 1208.838460] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ac139068 flags=0x0020]
[ 1208.838467] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000900007c0 flags=0x0020]
[ 1208.838474] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090001040 flags=0x0020]
[ 1208.838481] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090000f40 flags=0x0020]
[ 1208.838488] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000900016c0 flags=0x0020]
[ 1208.838495] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090002040 flags=0x0020]
[ 1208.838502] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090001e40 flags=0x0020]
[ 1208.838508] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000900025c0 flags=0x0020]
[ 1208.838515] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090003040 flags=0x0020]
[ 1208.838522] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090002d40 flags=0x0020]
[ 1208.838528] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x00000000900034c0 flags=0x0020]
[ 1208.838535] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090004040 flags=0x0020]
[ 1208.838542] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090003c40 flags=0x0020]
[ 1208.838548] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x00000000900043c0 flags=0x0020]
[ 1208.838554] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090005040 flags=0x0020]
[ 1208.838561] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090004b40 flags=0x0020]
[ 1208.838567] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x00000000900052c0 flags=0x0020]
[ 1208.838574] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090006040 flags=0x0020]
[ 1208.838580] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090005a40 flags=0x0020]

Note this is while the program is hanging on the GPUs. Is that what I was supposed to do?

RuABraun commented 4 years ago

And I don't actually know will it work to call the python program with cuda-memcheck (my worry is it only works when calling an actual binary file)?

vmarkovtsev commented 4 years ago

The dmesg log is actually very insightful. IO_PAGE_FAULT signals about nasty system bus errors, and I also see AMD-Vi. AMD-Vi is the virtualization kernel driver that is well known for causing problems with multi-GPU communication along with Intel VT-x.

Let's become 100% that this is the cause. Please compile and run nccl-tests. all_reduce should either fail or hang.
I hope you are not running in the cloud because I have bad news: this error is therefore hard to fix as you cannot boot a VM with disabled virtualization...
Anyway, boot the kernel with amd_iommu=off or amd_iommu=soft (add it to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and run update-grub, then reboot).

RuABraun commented 4 years ago

Ran build/all_reduce_prof and it worked for me as well as all the others.

Should I still try step 3?

vmarkovtsev commented 4 years ago

OK, then let's try cuda-memcheck.

RuABraun commented 4 years ago

Not really getting much output...

seni@seni-MS-7A32:/work/fun/subword-repr$ cuda-memcheck --leak-check full --print-level info py3 km.py data/repr_nums_small 
========= CUDA-MEMCHECK
2020-01-30 12:23:43.157 | INFO     | __main__:kmeans:11 - Starting, data shape is (67673, 256).
arguments: 1 0x7ffdac337424 0.001 0.05 0 67673 256 33836 3 0 0 3 0x7fc777262010 0x7fc775156010 0x25bfc40 0x7ffdac33744c
reassignments threshold: 67
yinyang groups: 1691
[0] *dest: 0x7fc728000000 - 0x7fc72c216400 (69297152)
[1] *dest: 0x7fc755200000 - 0x7fc759416400 (69297152)
[0] device_centroids: 0x7fc72c400000 - 0x7fc72e50b000 (34648064)
[1] device_centroids: 0x7fc71c000000 - 0x7fc71e10b000 (34648064)
[0] device_assignments: 0x7fc71e200000 - 0x7fc71e242164 (270692)
[1] device_assignments: 0x7fc71e400000 - 0x7fc71e442164 (270692)
[0] device_assignments_prev: 0x7fc71e242200 - 0x7fc71e284364 (270692)
[1] device_assignments_prev: 0x7fc71e442200 - 0x7fc71e484364 (270692)
[0] device_ccounts: 0x7fc71e284400 - 0x7fc71e2a54b0 (135344)
[1] device_ccounts: 0x7fc71e484400 - 0x7fc71e4a54b0 (135344)
[0] device_assignments_yy: 0x7fc71e2a5600 - 0x7fc71e2c66b0 (135344)
[1] device_assignments_yy: 0x7fc71e4a5600 - 0x7fc71e4c66b0 (135344)
[0] device_bounds_yy: 0x7fc70e000000 - 0x7fc71ba665b0 (229008816)
[1] device_bounds_yy: 0x7fc700000000 - 0x7fc70da665b0 (229008816)
[0] device_drifts_yy: 0x7fc6fc000000 - 0x7fc6fe12c0b0 (34783408)
[1] device_drifts_yy: 0x7fc6f8000000 - 0x7fc6fa12c0b0 (34783408)
[0] device_passed_yy: 0x7fc71e2c6800 - 0x7fc71e2e931c (142108)
[1] device_passed_yy: 0x7fc71e4c6800 - 0x7fc71e4e931c (142108)
[0] device_centroids_yy: 0x7fc6fa200000 - 0x7fc6fa3a6c00 (1731584)
[1] device_centroids_yy: 0x7fc6fa400000 - 0x7fc6fa5a6c00 (1731584)
GPU #0 memory: used 544735232 bytes (4.6%), free 11176771584 bytes, total 11721506816 bytes
GPU #1 memory: used 544735232 bytes (4.6%), free 11175395328 bytes, total 11720130560 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(2115, 8), (8, 32)>>> 67673, 256, xyswap

vmarkovtsev commented 4 years ago

So there are no memory errors except that there are memory errors according to dmesg.

Then try disabling the iommu.

src-d / kmcuda

Trying to use 2 GPUs results in neverending process that can't be killed without restart #100