Open RuABraun opened 4 years ago
I wonder if this is like a bad magic shape. Can you please try sizes from 69600 till 69700?
Besides, can you please post the exact log at maximum verbosity level. Also, are you using Python, R or native interface? Linux, Windows, or MacOS? CUDA version? Did you compile yourself or used the wheel?
I'm using the python interface. Ubuntu 16.04, CUDA 10.2.
I compiled using instructions in the README. Then moved the .so to where I'm using it.
It fails for me with 60 000 as well.
Here's the output (still running):
seni@seni-MS-7A32:/work/fun/subword-repr$ CUDA_VISIBLE_DEVICES="0,1" py3 km.py data/repr_nums_small
2020-01-29 20:24:33.629 | INFO | __main__:kmeans:11 - Starting, data shape is (60000, 256).
arguments: 1 0x7ffd7a3368e4 0.001 0.10 0 60000 256 6000 3 0 0 2 0x7ff717f96010 0x7ff7179b9010 0x2e1bae0 0x7ffd7a33690c
reassignments threshold: 60
yinyang groups: 600
GPU #0 memory: used 318242816 bytes (2.7%), free 11403264000 bytes, total 11721506816 bytes
GPU #1 memory: used 318242816 bytes (2.7%), free 11401887744 bytes, total 11720130560 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(1875, 8), (8, 32)>>> 60000, 256, xyswap
Here's nvidia-smi
seni@seni-MS-7A32:/work/fun/subword-repr$ nvidia-smi
Wed Jan 29 20:31:09 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 1030 Off | 00000000:08:00.0 On | N/A |
| 43% 46C P0 N/A / 30W | 391MiB / 2001MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 31% 55C P2 82W / 250W | 431MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:42:00.0 On | N/A |
| 21% 56C P2 88W / 250W | 431MiB / 11177MiB | 100% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1752 G /usr/lib/xorg/Xorg 222MiB |
| 0 2545 G compiz 148MiB |
| 0 3636 G ...quest-channel-token=9743865575637794024 18MiB |
| 1 116273 C python3.6 291MiB |
| 2 116273 C python3.6 291MiB |
+-----------------------------------------------------------------------------+
Maybe it's because of Exclusive Process?
With verbose=3
arguments: 1 0x7ffda37bcbf4 0.001 0.10 0 60000 256 6000 3 0 0 3 0x7fcd46527010 0x7fcd45f4a010 0x1f43050 0x7ffda37bcc1c
reassignments threshold: 60
yinyang groups: 600
[0] *dest: 0x7fcd18000000 - 0x7fcd1ba98000 (61440000)
[1] *dest: 0x7fccfc000000 - 0x7fccffa98000 (61440000)
[0] device_centroids: 0x7fcd07000000 - 0x7fcd075dc000 (6144000)
[1] device_centroids: 0x7fcd07600000 - 0x7fcd07bdc000 (6144000)
[0] device_assignments: 0x7fccffc00000 - 0x7fccffc3a980 (240000)
[1] device_assignments: 0x7fccffe00000 - 0x7fccffe3a980 (240000)
[0] device_assignments_prev: 0x7fccffc3aa00 - 0x7fccffc75380 (240000)
[1] device_assignments_prev: 0x7fccffe3aa00 - 0x7fccffe75380 (240000)
[0] device_ccounts: 0x7fccffc75400 - 0x7fccffc7b1c0 (24000)
[1] device_ccounts: 0x7fccffe75400 - 0x7fccffe7b1c0 (24000)
[0] device_assignments_yy: 0x7fccffc7b200 - 0x7fccffc80fc0 (24000)
[1] device_assignments_yy: 0x7fccffe7b200 - 0x7fccffe80fc0 (24000)
[0] device_bounds_yy: 0x7fccf6000000 - 0x7fccfa4c76c0 (72120000)
[1] device_bounds_yy: 0x7fccf0000000 - 0x7fccf44c76c0 (72120000)
[0] device_drifts_yy: 0x7fccf4600000 - 0x7fccf4be1dc0 (6168000)
[1] device_drifts_yy: 0x7fccf4c00000 - 0x7fccf51e1dc0 (6168000)
[0] device_passed_yy: 0x7fccffc81000 - 0x7fccffc9e4c0 (120000)
[1] device_passed_yy: 0x7fccffe81000 - 0x7fccffe9e4c0 (120000)
[0] device_centroids_yy: 0x7fccffc9e600 - 0x7fccffd34600 (614400)
[1] device_centroids_yy: 0x7fccffe9e600 - 0x7fccfff34600 (614400)
GPU #0 memory: used 318242816 bytes (2.7%), free 11403264000 bytes, total 11721506816 bytes
GPU #1 memory: used 318242816 bytes (2.7%), free 11401887744 bytes, total 11720130560 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(1875, 8), (8, 32)>>> 60000, 256, xyswap
100% GPU means that the code entered an infinite cycle... It is at https://github.com/src-d/kmcuda/blob/master/src/transpose.cu#L30
CUDA 10+ has not been tested yet, so this must be a code compatibility problem and not a "real" bug. Let's try three things:
Remove volatile
in all the *.cu
files. Then recompile the .so
. I observed some miscompilations on the newer CUDA versions and even reported them to NVIDIA, but never had time to convince them that there was a problem. The efficiency will likely degrade, but we are trying to make the code work at all.
Is there anything relevant printed in dmesg
?
Run the code with volatile
under cuda-memcheck
. It will run 20x times slower but we should see the problems reported, if any.
Did 1. with sed -i 's/volatile//g' src/*cu
, compiled and ran again, also hanging.
Sorry I'm not that experienced with this sort of thing what do you mean with "anything relevant printed in dmesg"? If I run dmesg
I get lots of output, the only part that is probably related is
[ 1208.838449] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090000040 flags=0x0020]
[ 1208.838460] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ac139068 flags=0x0020]
[ 1208.838467] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000900007c0 flags=0x0020]
[ 1208.838474] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090001040 flags=0x0020]
[ 1208.838481] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090000f40 flags=0x0020]
[ 1208.838488] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000900016c0 flags=0x0020]
[ 1208.838495] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090002040 flags=0x0020]
[ 1208.838502] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090001e40 flags=0x0020]
[ 1208.838508] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000900025c0 flags=0x0020]
[ 1208.838515] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090003040 flags=0x0020]
[ 1208.838522] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090002d40 flags=0x0020]
[ 1208.838528] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x00000000900034c0 flags=0x0020]
[ 1208.838535] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090004040 flags=0x0020]
[ 1208.838542] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090003c40 flags=0x0020]
[ 1208.838548] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x00000000900043c0 flags=0x0020]
[ 1208.838554] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090005040 flags=0x0020]
[ 1208.838561] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090004b40 flags=0x0020]
[ 1208.838567] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x00000000900052c0 flags=0x0020]
[ 1208.838574] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090006040 flags=0x0020]
[ 1208.838580] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090005a40 flags=0x0020]
Note this is while the program is hanging on the GPUs. Is that what I was supposed to do?
And I don't actually know will it work to call the python program with cuda-memcheck (my worry is it only works when calling an actual binary file)?
The dmesg log is actually very insightful. IO_PAGE_FAULT
signals about nasty system bus errors, and I also see AMD-Vi
. AMD-Vi is the virtualization kernel driver that is well known for causing problems with multi-GPU communication along with Intel VT-x.
nccl-tests
. all_reduce
should either fail or hang.amd_iommu=off
or amd_iommu=soft
(add it to GRUB_CMDLINE_LINUX_DEFAULT
in /etc/default/grub
and run update-grub
, then reboot).Ran build/all_reduce_prof
and it worked for me as well as all the others.
Should I still try step 3?
OK, then let's try cuda-memcheck
.
Not really getting much output...
seni@seni-MS-7A32:/work/fun/subword-repr$ cuda-memcheck --leak-check full --print-level info py3 km.py data/repr_nums_small
========= CUDA-MEMCHECK
2020-01-30 12:23:43.157 | INFO | __main__:kmeans:11 - Starting, data shape is (67673, 256).
arguments: 1 0x7ffdac337424 0.001 0.05 0 67673 256 33836 3 0 0 3 0x7fc777262010 0x7fc775156010 0x25bfc40 0x7ffdac33744c
reassignments threshold: 67
yinyang groups: 1691
[0] *dest: 0x7fc728000000 - 0x7fc72c216400 (69297152)
[1] *dest: 0x7fc755200000 - 0x7fc759416400 (69297152)
[0] device_centroids: 0x7fc72c400000 - 0x7fc72e50b000 (34648064)
[1] device_centroids: 0x7fc71c000000 - 0x7fc71e10b000 (34648064)
[0] device_assignments: 0x7fc71e200000 - 0x7fc71e242164 (270692)
[1] device_assignments: 0x7fc71e400000 - 0x7fc71e442164 (270692)
[0] device_assignments_prev: 0x7fc71e242200 - 0x7fc71e284364 (270692)
[1] device_assignments_prev: 0x7fc71e442200 - 0x7fc71e484364 (270692)
[0] device_ccounts: 0x7fc71e284400 - 0x7fc71e2a54b0 (135344)
[1] device_ccounts: 0x7fc71e484400 - 0x7fc71e4a54b0 (135344)
[0] device_assignments_yy: 0x7fc71e2a5600 - 0x7fc71e2c66b0 (135344)
[1] device_assignments_yy: 0x7fc71e4a5600 - 0x7fc71e4c66b0 (135344)
[0] device_bounds_yy: 0x7fc70e000000 - 0x7fc71ba665b0 (229008816)
[1] device_bounds_yy: 0x7fc700000000 - 0x7fc70da665b0 (229008816)
[0] device_drifts_yy: 0x7fc6fc000000 - 0x7fc6fe12c0b0 (34783408)
[1] device_drifts_yy: 0x7fc6f8000000 - 0x7fc6fa12c0b0 (34783408)
[0] device_passed_yy: 0x7fc71e2c6800 - 0x7fc71e2e931c (142108)
[1] device_passed_yy: 0x7fc71e4c6800 - 0x7fc71e4e931c (142108)
[0] device_centroids_yy: 0x7fc6fa200000 - 0x7fc6fa3a6c00 (1731584)
[1] device_centroids_yy: 0x7fc6fa400000 - 0x7fc6fa5a6c00 (1731584)
GPU #0 memory: used 544735232 bytes (4.6%), free 11176771584 bytes, total 11721506816 bytes
GPU #1 memory: used 544735232 bytes (4.6%), free 11175395328 bytes, total 11720130560 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(2115, 8), (8, 32)>>> 67673, 256, xyswap
So there are no memory errors except that there are memory errors according to dmesg
.
Then try disabling the iommu.
I have two 1080tis. When I try and use kmcuda with both by setting CUDA_VISIBLE_DEVICES to the GPUs I use for compute, (I've tried this out with device=0 and device=3), it gets stuck on
transposing the samples...
, util shows 100%, but power usage isn't high at all. Data has shape(69673, 256)
.I've waited a couple minutes, it takes half a minute usually on 1 gpu.
If I control-c/z it or kill the ID, the process becomes a zombie and I'm forced to restart to kill it (and use the gpus again).