The training stops after 10000 updates, and the process runs
for days without any progress.
With a validation frequency of 5000 I see the output of the validation step, but the training seems to halt with the process using one cpu core fully.
This is the output of top
top - 14:41:49 up 4 days, 18:03, 1 user, load average: 1.00, 1.02, 1.00
Tasks: 203 total, 1 running, 202 sleeping, 0 stopped, 0 zombie
%Cpu(s): 25.1 us, 0.2 sy, 0.0 ni, 74.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15718.1 total, 228.6 free, 8498.2 used, 6991.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 6880.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
302151 ubuntu 20 0 25.9g 7.8g 147460 S 99.7 51.0 130:38.85 marian
1475 ubuntu 20 0 13744 9764 2764 S 0.7 0.1 7:59.81 tmux: server
302792 root -51 0 0 0 0 S 0.3 0.0 0:31.97 irq/42-nvidia
302797 root 20 0 0 0 0 S 0.3 0.0 0:14.71 nv_queue
1 root 20 0 169004 10380 5768 S 0.0 0.1 0:23.19 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kblo+
9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
I'm training a transformer model on a corpus of 30M sentences with the following command line parameters:
The training stops after 10000 updates, and the process runs
for days without any progress. With a validation frequency of 5000 I see the output of the validation step, but the training seems to halt with the process using one cpu core fully.
This is the output of
top
My configuration is
I'm also attaching the train.log
Any idea of what I could be doing wrong? Thanks in advance for your help