thu-spmi / CAT

A CRF-based ASR Toolkit
Apache License 2.0
324 stars 74 forks source link

How to rerun test step? #59

Closed Sar-Dar closed 2 years ago

Sar-Dar commented 2 years ago

When I run this script https://github.com/thu-spmi/CAT/blob/15ed6f22b31f76f77c1349d32b824b92b1667629/egs/commonvoice/run_mc.sh#L235-L246 succeed in trianing step , but not in testing step,

Test: [5260/5412]       Time  0.573 ( 1.062)    Data  0.001 ( 0.551)    Loss_real 2.5580e+01 (2.1091e+01)
Test: [5270/5412]       Time  0.575 ( 1.061)    Data  0.002 ( 0.550)    Loss_real 2.1622e+01 (2.1092e+01)
Test: [5280/5412]       Time  0.581 ( 1.060)    Data  0.002 ( 0.549)    Loss_real 2.6419e+01 (2.1094e+01)
Test: [5290/5412]       Time  0.481 ( 1.059)    Data  0.001 ( 0.548)    Loss_real 2.5244e+01 (2.1089e+01)
Test: [5300/5412]       Time  0.542 ( 1.058)    Data  0.002 ( 0.547)    Loss_real 2.9388e+01 (2.1091e+01)
Test: [5310/5412]       Time  0.549 ( 1.057)    Data  0.001 ( 0.545)    Loss_real 1.4099e+01 (2.1093e+01)
Test: [5320/5412]       Time  0.510 ( 1.056)    Data  0.000 ( 0.544)    Loss_real 3.1891e+01 (2.1092e+01)
Test: [5330/5412]       Time  0.541 ( 1.055)    Data  0.000 ( 0.543)    Loss_real 1.4288e+01 (2.1090e+01)
Test: [5340/5412]       Time  0.508 ( 1.054)    Data  0.001 ( 0.542)    Loss_real 2.1985e+01 (2.1087e+01)
Test: [5350/5412]       Time  0.469 ( 1.052)    Data  0.001 ( 0.541)    Loss_real 2.3008e+01 (2.1091e+01)
Test: [5360/5412]       Time  0.436 ( 1.051)    Data  0.002 ( 0.540)    Loss_real 2.0166e+01 (2.1095e+01)
Test: [5370/5412]       Time  0.527 ( 1.050)    Data  0.001 ( 0.539)    Loss_real 2.4653e+01 (2.1095e+01)
Test: [5380/5412]       Time  0.434 ( 1.049)    Data  0.001 ( 0.538)    Loss_real 2.5790e+01 (2.1101e+01)
Test: [5390/5412]       Time  0.512 ( 1.048)    Data  0.002 ( 0.537)    Loss_real 1.9989e+01 (2.1098e+01)
Test: [5400/5412]       Time  0.558 ( 1.047)    Data  0.001 ( 0.536)    Loss_real 3.2615e+01 (2.1102e+01)
Test: [5410/5412]       Time  0.556 ( 1.046)    Data  0.018 ( 0.535)    Loss_real 1.0644e+01 (2.1104e+01)
Epoch: [2@0] | best=21.11 | current=21.11 | worse_count=0 | lr=1.00e-04
> Monitor figure saved at exp/mc_flatphone/monitor.png
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/ssdhome/sardar321/anaconda3/envs/torch/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/ssdhome/sardar321/anaconda3/envs/torch/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
MemoryError
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/ssdhome/sardar321/anaconda3/envs/torch/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/ssdhome/sardar321/anaconda3/envs/torch/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
MemoryError

so how can I just rerun the testing step?

maxwellzh commented 2 years ago

According to the log, as the monitor.png figure is saved, the current epoch is finished. So this is probably not an error of testing, but something wrong at the startup of the next epoch.

There should be a checkpoint xxx.pt in the folder exp/mc_flatphone/. You could add the argument to python3 ctc-crf/train.py

--resume=exp/mc_flatphone/xxx.pt

to continue from the checkpoint.

Sar-Dar commented 2 years ago

Problems are solved,Thanks for you kindness and support