amun leaking onto other GPUs?

mjpost commented 7 years ago

Hi — I have amun running on a single GPU (device 1) with the following command:

$ ps aux | grep amun
mpost    105234 93.7  0.5 88712444 378592 ?     Sl   12:34   1:17 
amun -c validate/config.model.iter1020000.yml -d 1 --gpu-threads 1 --i data/validate.bpe.tr

Its usage, though, spills over to multiple GPUs, as can be seen here from the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    105234    C   amun       78MiB                                      |
|    1    105234    C   amun      749MiB                                      |
+-----------------------------------------------------------------------------+

This results in other jobs dying when they are handed GPU 0 by the cluster manager. Is this supposed to happen?

emjotde commented 7 years ago

That should not be happening and seems to be a regression. @hieuhoang ?

emjotde commented 7 years ago

@mjpost you can use s2s for now.

hieuhoang commented 7 years ago

what's your amun version (git log)? Can I have a look at your file validate/config.model.iter1020000.yml

mjpost commented 7 years ago

I'm pulling the latest version of the code every day. An update, though: this only happens

when I ask for a GPU > 0
when the lower-sequenced GPUs are not already taken.

Here, I asked for GPU 2. It seems to put a little bit on GPU 0, and then the rest where it's supposed to go:

$ nvidia-smi
Mon Jul 24 21:15:49 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.20                 Driver Version: 375.20                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          On   | 0000:03:00.0     Off |                    0 |
| N/A   29C    P0    63W / 235W |     93MiB / 11471MiB |      2%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          On   | 0000:04:00.0     Off |                    0 |
| N/A   29C    P8    20W / 235W |     11MiB / 11471MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K40m          On   | 0000:83:00.0     Off |                    0 |
| N/A   34C    P0    66W / 235W |    525MiB / 11471MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K40m          On   | 0000:84:00.0     Off |                    0 |
| N/A   32C    P8    20W / 235W |     11MiB / 11471MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     21952    C   /home/hltcoe/mpost/code/marian/build/amun       83MiB |
|    2     21952    C   /home/hltcoe/mpost/code/marian/build/amun      514MiB |
+-----------------------------------------------------------------------------+

So I wonder if there is some code somewhere that is just hardcoded to use 0, but doesn't fail if it's not available?

mjpost commented 7 years ago

@hieuhoang here's the config file:

# Paths are relative to config file location
relative-paths: yes
# performance settings
beam-size: 12
normalize: yes
# scorer configuration
scorers:
  F0:
    path: ../model/model.best.npz
    type: Nematus
# scorer weights
weights:
  F0: 1.0
# vocabularies
source-vocab: ../data/train.bpe.de.json
target-vocab: ../data/train.bpe.en.json
debpe: false

And here's the invocation:

Running /home/post/code/marian/build/amun -c test/config.model.best.yml -d 2 --gpu-threads 1 -i test/newstest2016.de-en.bpe.de > test/newstest2016.de-en.model.best.output...

hieuhoang commented 7 years ago

yes, default device = 0 for a thread unless you set it to something different.

Just realised my device 0 is the little GPU that nvidia-smi doesn't support so I haven't been able to see these leaks. I'll see what I can do

hieuhoang commented 7 years ago

does it happen at the start, throughout the run, or only at the end of the run?

hieuhoang commented 7 years ago

I don't seem to be able to replicate the problem. What OS, gcc, nvcc versions are you using? If you are able to share the model, I might have a chance. If you have a aws instance that I can log into, even better

emjotde commented 7 years ago

I wasn't able to reproduce it either on several machines with multiple GPUs.

mjpost commented 7 years ago

Environment:

CentOS Linux 7.3.1611
nvcc 8.0.44
gcc 4.9.3

It doesn't just happen with GPU 0; it seems that any time I skip a GPU, it happens. Here, GPU 2 gets skipped in assignment, and spillage occurs onto GPU 2 from PID 5870. It happens throughout the run.

$ nvidia-smi
Tue Jul 25 09:57:16 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.20                 Driver Version: 375.20                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          On   | 0000:03:00.0     Off |                    0 |
| N/A   49C    P0   151W / 235W |    776MiB / 11471MiB |     98%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          On   | 0000:04:00.0     Off |                    0 |
| N/A   50C    P0   151W / 235W |    776MiB / 11471MiB |     98%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K40m          On   | 0000:83:00.0     Off |                    0 |
| N/A   36C    P0    66W / 235W |     98MiB / 11471MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K40m          On   | 0000:84:00.0     Off |                    0 |
| N/A   53C    P0   156W / 235W |    776MiB / 11471MiB |     98%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      5831    C   /home/hltcoe/mpost/code/marian/build/amun      765MiB |
|    1      5868    C   /home/hltcoe/mpost/code/marian/build/amun      765MiB |
|    2      5870    C   /home/hltcoe/mpost/code/marian/build/amun       87MiB |
|    3      5870    C   /home/hltcoe/mpost/code/marian/build/amun      765MiB |
+-----------------------------------------------------------------------------+
$ ps aux | grep amun
mpost     5831 99.7  0.1 193458152 242892 ?    Sl   09:54   3:27 /home/post/code/marian/build/amun -c validate/config.model.iter100000.yml -d 0 -i data/validate.bpe.en
mpost     5868 99.5  0.1 193458152 242512 ?    Sl   09:54   3:27 /home/post/code/marian/build/amun -c validate/config.model.iter160000.yml -d 1 -i data/validate.bpe.en
mpost     5870 99.6  0.3 193591404 399152 ?    Sl   09:54   3:27 /home/post/code/marian/build/amun -c validate/config.model.iter180000.yml -d 3 -i data/validate.bpe.en

I'll try to repeat this in another environment.

hieuhoang commented 7 years ago

If you want a safe way to run it, then set this env variable before running amun export CUDA_VISIBLE_DEVICES=2 Then when running amun, always set -d 0.

However, this bug should be fixed as it's likely to bite us in another place sooner or later. If there's any way you can do it on open data & on a machine & can get access to, it would be best as its likely to be hw & sw dependent

mjpost commented 7 years ago

Setting CUDA_VISIBLE_DEVICES to a single device (e.g., "2") and then using "-d 0" works as expected. However, if I set

CUDA_VISIBLE_DEVICES=0,2

and then run with "-d 1", it leaks onto GPU 0 as before.

emjotde commented 7 years ago

I am seeing this now on one of the WIPO machines, will take a closer looks as soon as I have time.

hieuhoang commented 6 years ago

@mjpost have you played around with amun recently and got any further with this problem?

mjpost commented 6 years ago

No, I haven't touched this. I just switched to marian-decoder.

oraveczcsaba commented 6 years ago

I can also confirm the above problem in a more or less similar 4 GPU environment.

hieuhoang commented 6 years ago

access to 4 GPU environment would be useful to debugging this

hieuhoang commented 6 years ago

I can't seem to reproduce this on a 4 GPU server (M60, cuda 9.2, nvidia 396.24 driver, ubuntu 16.04) with the latest code.

Perhaps the bug has gone away with code changes or driver or compiler updates

Closing this now but please re-open if you are able to reproduce it

marian-nmt / marian

amun leaking onto other GPUs? #102