Closed dantmnf closed 2 years ago
That's a good question. Normally it is good that make
deletes a target when it is interrupted. That is reasonable for partially generated files like compiler or linker output. It would also be reasonable for a checkpoint file if the interrupt occurs while it is written, but in most cases a valid checkpoint file will be deleted which is indeed undesired.
It is the <model_base>_checkpoint
file that gets deleted. This is the latest model file along with backup models to be used if the training runs into divergence. So, training cannot be restarted.
Two recent cases:
Iteration 256706: GROUND TRUTH : ꦏꦢꦢꦺꦪꦤ꧀ ꦲꦶꦁ ꦔ꧀ꦒꦧꦸꦁ ꦏꦭꦶꦪꦤ꧀ = ꦗꦺꦤꦺꦁ ꦤꦺꦴꦩꦺꦂ ꦱꦶꦗꦶ ꦭꦤ꧀ ꦔꦤ꧀ꦠꦶ ꦛꦺ ꦌ ꦔ꧀ ꦤꦤ
Iteration 256706: ALIGNED TRUTH : ꦏꦢꦢꦺꦪꦤ꧀ ꦲꦶꦁꦔ꧀ꦒꦧꦸꦁ ꦏꦭꦶꦪꦤ꧀ = ꦗꦺꦤꦺꦁ ꦤꦺꦴꦩꦺꦂ ꦱꦶꦗꦶ ꦭꦤ꧀ ꦔꦤ꧀ꦠꦶ ꦛꦺ ꦌ ꦔ꧀ ꦤꦤ
Iteration 256706: BEST OCR TEXT : ꦏꦢꦢꦺꦪꦤ꧀ ꦲꦶꦁ ꦔ꧀ꦭꦧꦸꦁ ꦏꦭꦶꦪꦤ꧀ = ꦗꦺꦤꦺꦁ ꦤꦺꦴꦩꦺꦂ ꦱꦶꦗꦶ ꦭꦤ꧀ ꦔꦤ꧀ꦠꦶ ꦛꦺ ꦌ ꦥ꧀ ꦤꦤ
File /tmp/jav_java-2022-01-10ol1c1pg1/jav_java.CARAKAN_JAWA_Semi-Expanded.exp0.lstmf line 2011 :
Mean rms=0.569%, delta=1.751%, train=5.27%(17.62%), skip ratio=2.1%
Iteration 256707: GROUND TRUTH : NAME="COLORFUL > DHEWEKE NGALAHKE SALAH Rosta DIMUNGSUH. tembang JAM JILID
File /tmp/jav-2022-01-10rb8nsh5e/jav.Trebuchet_MS_Bold.exp0.lstmf line 1216 (Perfect):
make: *** Deleting file 'data/JAV/checkpoints/JAV_checkpoint'
make: *** [Makefile:297: data/JAV/checkpoints/JAV_checkpoint] Terminated
tail -f /home/ubuntu/tesstrain/plot/TXT2IMG.LOG
UpdateSubtrainer:Sub:At iteration 285716/1315900/1315902, Mean rms=0.154000%, delta=0.131000%, BCER train=0.383000%, BWER train=1.340000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 285726/1316000/1316002, Mean rms=0.151000%, delta=0.125000%, BCER train=0.374000%, BWER train=1.303000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 285741/131610UpdateSubtrainer:Sub:At iteration 297221/1396000/1396002, Mean rms=0.153000%, delta=0.127000%, BCER train=0.412000%, BWER train=1.491000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 297241/1396100/1396102, Mean rms=0.155000%, delta=0.132000%, BCER train=0.434000%, BWER train=1.571000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 297256/1396200/1396202, Mean rms=0.154000%, delta=0.125000%, BCER train=0.414000%, BWER train=1.521000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 297267/1396300/1396302, Mean rms=0.152000%, delta=0.119000%, BCER train=0.395000%, BWER train=1.402000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 297281/1396400/1396402, Mean rms=0.152000%, delta=0.122000%, BCER train=0.406000%, BWER train=1.387000%, skip ratio=0.000000%,
make: *** Deleting file 'data/TXT2IMG/checkpoints/TXT2IMG_checkpoint'
Makefile:294: recipe for target 'data/TXT2IMG/checkpoints/TXT2IMG_checkpoint' failed
make: *** [data/TXT2IMG/checkpoints/TXT2IMG_checkpoint] Terminated
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Appending
|| true
to all lstmtraining command makes it resilient to crash, but not for interrupt.Any idea to stop make from deleting checkpoint?