marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
257 stars 126 forks source link

Training Optimization Question #992

Open woolz opened 1 year ago

woolz commented 1 year ago

Current I'm training a large model (114M sentences) with 2 GPUS but I see a problem on GPU parallelism during the training on nvidia-smi.

``

Second GPU are all time 98-99% usage OK But the First GPU have a fluctuation on GPU-Utilization, sometimes on 11% and others 45%, 95% etc..

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060         Off| 00000000:01:00.0 Off |                  N/A |
| 39%   56C    P2               55W / 170W|   11343MiB / 12288MiB |     11%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060         Off| 00000000:02:00.0 Off |                  N/A |
| 37%   43C    P2               48W / 170W|   11343MiB / 12288MiB |     99%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

It's normal or have a build/configuration problem? train-marian.txt

[2023-05-18 03:17:45] Using synchronous SGD [2023-05-18 03:17:57] [training] Batches are processed as 1 process(es) x 2 devices/process