tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

make deletes checkpoint file on crash/interrupt #299

Closed dantmnf closed 2 years ago

dantmnf commented 2 years ago

At iteration 496/500/500, Mean rms=0.858000%, delta=4.249000%, BCER train=13.167000%, BWER train=98.083000%, skip ratio=0.000000%,  New worst BCER = 13.167000 wrote checkpoint.

At iteration 595/600/600, Mean rms=0.841000%, delta=4.030000%, BCER train=12.671000%, BWER train=98.153000%, skip ratio=0.000000%,  New worst BCER = 12.671000 wrote checkpoint.

^Cmake: *** Deleting file 'data/chi_sim.shs_medium/checkpoints/chi_sim.shs_medium_checkpoint'
make: *** [Makefile:278: data/chi_sim.shs_medium/checkpoints/chi_sim.shs_medium_checkpoint] Interrupt

Appending || true to all lstmtraining command makes it resilient to crash, but not for interrupt.

Any idea to stop make from deleting checkpoint?

stweil commented 2 years ago

That's a good question. Normally it is good that make deletes a target when it is interrupted. That is reasonable for partially generated files like compiler or linker output. It would also be reasonable for a checkpoint file if the interrupt occurs while it is written, but in most cases a valid checkpoint file will be deleted which is indeed undesired.

Shreeshrii commented 2 years ago

It is the <model_base>_checkpoint file that gets deleted. This is the latest model file along with backup models to be used if the training runs into divergence. So, training cannot be restarted.

Two recent cases:

Iteration 256706: GROUND  TRUTH : ꦏꦢꦢꦺꦪꦤ꧀ ꦲꦶꦁ ꦔ꧀ꦒꦧꦸꦁ ꦏꦭꦶꦪꦤ꧀ = ꦗꦺꦤꦺꦁ ꦤꦺꦴꦩꦺꦂ ꦱꦶꦗꦶ ꦭꦤ꧀ ꦔꦤ꧀ꦠꦶ ꦛꦺ ꦌ ꦔ꧀ ꦤꦤ
Iteration 256706: ALIGNED TRUTH : ꦏꦢꦢꦺꦪꦤ꧀ ꦲꦶꦁꦔ꧀ꦒꦧꦸꦁ ꦏꦭꦶꦪꦤ꧀ = ꦗꦺꦤꦺꦁ ꦤꦺꦴꦩꦺꦂ ꦱꦶꦗꦶ ꦭꦤ꧀ ꦔꦤ꧀ꦠꦶ ꦛꦺ ꦌ ꦔ꧀ ꦤꦤ
Iteration 256706: BEST OCR TEXT : ꦏꦢꦢꦺꦪꦤ꧀ ꦲꦶꦁ ꦔ꧀ꦭꦧꦸꦁ ꦏꦭꦶꦪꦤ꧀ = ꦗꦺꦤꦺꦁ ꦤꦺꦴꦩꦺꦂ ꦱꦶꦗꦶ ꦭꦤ꧀ ꦔꦤ꧀ꦠꦶ ꦛꦺ ꦌ ꦥ꧀ ꦤꦤ
File /tmp/jav_java-2022-01-10ol1c1pg1/jav_java.CARAKAN_JAWA_Semi-Expanded.exp0.lstmf line 2011 :
Mean rms=0.569%, delta=1.751%, train=5.27%(17.62%), skip ratio=2.1%
Iteration 256707: GROUND  TRUTH : NAME="COLORFUL > DHEWEKE NGALAHKE SALAH Rosta DIMUNGSUH. tembang JAM JILID
File /tmp/jav-2022-01-10rb8nsh5e/jav.Trebuchet_MS_Bold.exp0.lstmf line 1216 (Perfect):
make: *** Deleting file 'data/JAV/checkpoints/JAV_checkpoint'
make: *** [Makefile:297: data/JAV/checkpoints/JAV_checkpoint] Terminated
tail -f /home/ubuntu/tesstrain/plot/TXT2IMG.LOG
UpdateSubtrainer:Sub:At iteration 285716/1315900/1315902, Mean rms=0.154000%, delta=0.131000%, BCER train=0.383000%, BWER train=1.340000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 285726/1316000/1316002, Mean rms=0.151000%, delta=0.125000%, BCER train=0.374000%, BWER train=1.303000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 285741/131610UpdateSubtrainer:Sub:At iteration 297221/1396000/1396002, Mean rms=0.153000%, delta=0.127000%, BCER train=0.412000%, BWER train=1.491000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 297241/1396100/1396102, Mean rms=0.155000%, delta=0.132000%, BCER train=0.434000%, BWER train=1.571000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 297256/1396200/1396202, Mean rms=0.154000%, delta=0.125000%, BCER train=0.414000%, BWER train=1.521000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 297267/1396300/1396302, Mean rms=0.152000%, delta=0.119000%, BCER train=0.395000%, BWER train=1.402000%, skip ratio=0.000000%,
UpdateSubtrainer:Sub:At iteration 297281/1396400/1396402, Mean rms=0.152000%, delta=0.122000%, BCER train=0.406000%, BWER train=1.387000%, skip ratio=0.000000%,
make: *** Deleting file 'data/TXT2IMG/checkpoints/TXT2IMG_checkpoint'
Makefile:294: recipe for target 'data/TXT2IMG/checkpoints/TXT2IMG_checkpoint' failed
make: *** [data/TXT2IMG/checkpoints/TXT2IMG_checkpoint] Terminated
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.