tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

Move tesstrain.py from tesseract and modify for single lines #229

Closed Shreeshrii closed 3 years ago

Shreeshrii commented 3 years ago

Ref: https://github.com/tesseract-ocr/tesseract/issues/3197#issuecomment-764512848

The following modifications to tesstrain_utils.py create single line files.

If the python scripts are moved/copied here from tesseract repo, I can make a PR.

20d19
< import io
28d26

>         f"--outputbase={outbase}",
343,361c343,351
<     with io.open(ctx.training_text, "r", encoding='utf-8') as gtText:
<         for count, line in enumerate(gtText):
<             if count > ctx.max_pages:
<                 break;
<             gtoutputbase=(str(outbase) + "." + str(count))
<             gtline=(str(gtoutputbase) + ".gt.txt")
<             gtFile = open(gtline, 'w', encoding='utf-8')
<             print(line, file=gtFile)
<             gtFile.close()
<             log.info(f"Running text2image on {gtline}")
<             run_command(
<                     "text2image",
<                     *common_args,
<                     f"--font={font}",
<                     f"--text={gtline}",
<                     f"--outputbase={gtoutputbase}",
<                     *ctx.text2image_extra_args,
<                 )
<             check_file_readable(str(gtoutputbase) + ".box", str(gtoutputbase) + ".tif")
---
>     run_command(
>         "text2image",
>         *common_args,
>         f"--font={font}",
>         f"--text={ctx.training_text}",
>         *ctx.text2image_extra_args,
>     )
> 
>     check_file_readable(str(outbase) + ".box", str(outbase) + ".tif")
421a412,416
>         # Check that each process was successful.
>         for font in ctx.fonts:
>             fontname = make_fontname(font)
>             outbase = make_outbase(ctx, fontname, exposure)
>             check_file_readable(str(outbase) + ".box", str(outbase) + ".tif")
TESSTRAIN_FONT=Arial
TESSTRAIN_LANG=eng
TESSTRAIN_MAX_PAGES=10
TESSTRAIN_MAX_ITERATIONS=100
TESSDATA_PREFIX=$HOME/tessdata_best
BASEDIR=$HOME
NEWLANG=engnew
NEWLANGTEXT=$BASEDIR/langdata/$TESSTRAIN_LANG/eng.training_text 

# cleanup previous training data and output
rm -rf $BASEDIR/train $BASEDIR/output
mkdir -p $BASEDIR/train $BASEDIR/output

# generate new training data
python ./tesstrain.py \
 --fonts_dir $BASEDIR/.fonts \
 --fontlist "$TESSTRAIN_FONT" \
 --lang $TESSTRAIN_LANG \
 --linedata_only \
 --noextract_font_properties \
 --exposures "0"    \
 --langdata_dir $BASEDIR/langdata_lstm \
 --training_text $NEWLANGTEXT \
 --tessdata_dir $TESSDATA_PREFIX \
 --save_box_tiff \
 --maxpages $TESSTRAIN_MAX_PAGES \
 --output_dir $BASEDIR/train
 python ./tesstrain.py --fonts_dir /home/ubuntu/.fonts --fontlist Arial --lang eng --linedata_only --noextract_font_properties --exposures 0 --langdata_dir /home/ubuntu/langdata_lstm --training_text /home/ubuntu/langdata/eng/eng.training_text --tessdata_dir /home/ubuntu/tessdata_best --save_box_tiff --maxpages 10 --output_dir /home/ubuntu/train
[09:58:18] INFO - Log file location: /tmp/eng-2021-01-21ocyx6n6r/tesstrain.log
[09:58:18] INFO - === Starting training for language eng
[09:58:18] INFO - Testing font: Arial
[09:58:31] INFO - === Phase I: Generating training images ===
  0%|                                                                                                                                                                  | 0/1 [00:00<?, ?it/s][09:58:31] INFO - Rendering using Arial
[09:58:31] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.0.gt.txt
[09:58:36] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.1.gt.txt
[09:58:42] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.2.gt.txt
[09:58:47] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.3.gt.txt
[09:58:53] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.4.gt.txt
[09:58:58] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.5.gt.txt
[09:59:04] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.6.gt.txt
[09:59:10] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.7.gt.txt
[09:59:15] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.8.gt.txt
[09:59:21] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.9.gt.txt
[09:59:26] INFO - Running text2image on /tmp/eng-2021-01-21ocyx6n6r/eng.Arial.exp0.10.gt.txt
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:01<00:00, 61.40s/it]
[09:59:32] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[09:59:39] INFO - === Phase E: Generating lstmf files ===
[09:59:39] INFO - Using TESSDATA_PREFIX=/home/ubuntu/tessdata_best
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:27<00:00,  2.47s/it]
[10:00:06] INFO - === Constructing LSTM training data ===
[10:01:57] INFO - === Saving box/tiff pairs for training data ===
[10:01:57] INFO - === Moving lstmf files for training data ===
[10:01:57] INFO - All done!
wrznr commented 3 years ago

@Shreeshrii @stweil That's a great plan. How can we proceed?

stweil commented 3 years ago

I now copied tesseract/src/training/*.py to tesstrain/src/training/*.py (including the commit history).

The export was done in the tesseract repository using this command:

git log --pretty=email --patch-with-stat --reverse --full-index --binary -- src/training/*.py >training_script.patch

Now the next steps can follow:

  1. Remove that files from the tesseract repository.
  2. Optionally move the files to a new place in the tesstrain repository.
  3. Add new modifications.
  4. Update documentation.
  5. Remove training scripts from the tesseract repository.
Shreeshrii commented 3 years ago

Thanks @stweil.

Optionally move the files to a new place in the tesstrain repository.

Should these be moved to the root directory, like the other scripts?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.