Open woolz opened 1 year ago
Current I'm training a large model (114M sentences) with 2 GPUS but I see a problem on GPU parallelism during the training on nvidia-smi.
``
Second GPU are all time 98-99% usage OK But the First GPU have a fluctuation on GPU-Utilization, sometimes on 11% and others 45%, 95% etc..
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 Off| 00000000:01:00.0 Off | N/A | | 39% 56C P2 55W / 170W| 11343MiB / 12288MiB | 11% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 Off| 00000000:02:00.0 Off | N/A | | 37% 43C P2 48W / 170W| 11343MiB / 12288MiB | 99% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
It's normal or have a build/configuration problem? train-marian.txt
[2023-05-18 03:17:45] Using synchronous SGD [2023-05-18 03:17:57] [training] Batches are processed as 1 process(es) x 2 devices/process
Current I'm training a large model (114M sentences) with 2 GPUS but I see a problem on GPU parallelism during the training on nvidia-smi.
``
Second GPU are all time 98-99% usage OK But the First GPU have a fluctuation on GPU-Utilization, sometimes on 11% and others 45%, 95% etc..
It's normal or have a build/configuration problem? train-marian.txt
[2023-05-18 03:17:45] Using synchronous SGD [2023-05-18 03:17:57] [training] Batches are processed as 1 process(es) x 2 devices/process