Closed mjpost closed 6 years ago
That should not be happening and seems to be a regression. @hieuhoang ?
@mjpost you can use s2s for now.
what's your amun version (git log)? Can I have a look at your file validate/config.model.iter1020000.yml
I'm pulling the latest version of the code every day. An update, though: this only happens
Here, I asked for GPU 2. It seems to put a little bit on GPU 0, and then the rest where it's supposed to go:
$ nvidia-smi
Mon Jul 24 21:15:49 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.20 Driver Version: 375.20 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m On | 0000:03:00.0 Off | 0 |
| N/A 29C P0 63W / 235W | 93MiB / 11471MiB | 2% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40m On | 0000:04:00.0 Off | 0 |
| N/A 29C P8 20W / 235W | 11MiB / 11471MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K40m On | 0000:83:00.0 Off | 0 |
| N/A 34C P0 66W / 235W | 525MiB / 11471MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K40m On | 0000:84:00.0 Off | 0 |
| N/A 32C P8 20W / 235W | 11MiB / 11471MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 21952 C /home/hltcoe/mpost/code/marian/build/amun 83MiB |
| 2 21952 C /home/hltcoe/mpost/code/marian/build/amun 514MiB |
+-----------------------------------------------------------------------------+
So I wonder if there is some code somewhere that is just hardcoded to use 0, but doesn't fail if it's not available?
@hieuhoang here's the config file:
# Paths are relative to config file location
relative-paths: yes
# performance settings
beam-size: 12
normalize: yes
# scorer configuration
scorers:
F0:
path: ../model/model.best.npz
type: Nematus
# scorer weights
weights:
F0: 1.0
# vocabularies
source-vocab: ../data/train.bpe.de.json
target-vocab: ../data/train.bpe.en.json
debpe: false
And here's the invocation:
Running /home/post/code/marian/build/amun -c test/config.model.best.yml -d 2 --gpu-threads 1 -i test/newstest2016.de-en.bpe.de > test/newstest2016.de-en.model.best.output...
yes, default device = 0 for a thread unless you set it to something different.
Just realised my device 0 is the little GPU that nvidia-smi doesn't support so I haven't been able to see these leaks. I'll see what I can do
does it happen at the start, throughout the run, or only at the end of the run?
I don't seem to be able to replicate the problem. What OS, gcc, nvcc versions are you using? If you are able to share the model, I might have a chance. If you have a aws instance that I can log into, even better
I wasn't able to reproduce it either on several machines with multiple GPUs.
Environment:
It doesn't just happen with GPU 0; it seems that any time I skip a GPU, it happens. Here, GPU 2 gets skipped in assignment, and spillage occurs onto GPU 2 from PID 5870. It happens throughout the run.
$ nvidia-smi
Tue Jul 25 09:57:16 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.20 Driver Version: 375.20 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m On | 0000:03:00.0 Off | 0 |
| N/A 49C P0 151W / 235W | 776MiB / 11471MiB | 98% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40m On | 0000:04:00.0 Off | 0 |
| N/A 50C P0 151W / 235W | 776MiB / 11471MiB | 98% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K40m On | 0000:83:00.0 Off | 0 |
| N/A 36C P0 66W / 235W | 98MiB / 11471MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K40m On | 0000:84:00.0 Off | 0 |
| N/A 53C P0 156W / 235W | 776MiB / 11471MiB | 98% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 5831 C /home/hltcoe/mpost/code/marian/build/amun 765MiB |
| 1 5868 C /home/hltcoe/mpost/code/marian/build/amun 765MiB |
| 2 5870 C /home/hltcoe/mpost/code/marian/build/amun 87MiB |
| 3 5870 C /home/hltcoe/mpost/code/marian/build/amun 765MiB |
+-----------------------------------------------------------------------------+
$ ps aux | grep amun
mpost 5831 99.7 0.1 193458152 242892 ? Sl 09:54 3:27 /home/post/code/marian/build/amun -c validate/config.model.iter100000.yml -d 0 -i data/validate.bpe.en
mpost 5868 99.5 0.1 193458152 242512 ? Sl 09:54 3:27 /home/post/code/marian/build/amun -c validate/config.model.iter160000.yml -d 1 -i data/validate.bpe.en
mpost 5870 99.6 0.3 193591404 399152 ? Sl 09:54 3:27 /home/post/code/marian/build/amun -c validate/config.model.iter180000.yml -d 3 -i data/validate.bpe.en
I'll try to repeat this in another environment.
If you want a safe way to run it, then set this env variable before running amun export CUDA_VISIBLE_DEVICES=2 Then when running amun, always set -d 0.
However, this bug should be fixed as it's likely to bite us in another place sooner or later. If there's any way you can do it on open data & on a machine & can get access to, it would be best as its likely to be hw & sw dependent
Setting CUDA_VISIBLE_DEVICES to a single device (e.g., "2") and then using "-d 0" works as expected. However, if I set
CUDA_VISIBLE_DEVICES=0,2
and then run with "-d 1", it leaks onto GPU 0 as before.
I am seeing this now on one of the WIPO machines, will take a closer looks as soon as I have time.
@mjpost have you played around with amun recently and got any further with this problem?
No, I haven't touched this. I just switched to marian-decoder.
I can also confirm the above problem in a more or less similar 4 GPU environment.
access to 4 GPU environment would be useful to debugging this
I can't seem to reproduce this on a 4 GPU server (M60, cuda 9.2, nvidia 396.24 driver, ubuntu 16.04) with the latest code.
Perhaps the bug has gone away with code changes or driver or compiler updates
Closing this now but please re-open if you are able to reproduce it
Hi — I have amun running on a single GPU (device 1) with the following command:
Its usage, though, spills over to multiple GPUs, as can be seen here from the output of
nvidia-smi
:This results in other jobs dying when they are handed GPU 0 by the cluster manager. Is this supposed to happen?