tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

Training with python: run training step ? #351

Open forzagreen opened 9 months ago

forzagreen commented 9 months ago

As mentioned by @stefan6419846 in https://github.com/madmaze/pytesseract/issues/508 , there is a python wrapper for training in tesstrain/src/ , which unfortunately is not documented in tesseract, tessdoc and tesstrain repositories.

From my understanding: (please correct me if I'm wrong)

  1. It only generates lstmf files, and does not perform any training. In the steps mentioned in Overview of Training Process, it stops at step 5. Steps 6 and 7 must be done separately. Is that correct ?

  2. How to perform steps 6 and 7 ? with Makefile commands ? if you give me some inputs, I can help adding these steps to the python script.

  3. The python script takes a TEXTFILE and generates (for each font) box/tif/lstmf files for the hole text, not line by line. So, in order to generate line by line, we must run the script for each one-line file ?

Thanks in advance !

Cc: @stefan6419846

stefan6419846 commented 9 months ago

tesstrain basically creates artificial training data for doing finetuning with a specific font for example. You might find some existing examples using the old tesstrain.sh script which should be roughly equivalent for tesstrain. The Makefile approach is for "real" data only.

Rough steps for the Python module:

  1. Extract LSTM file: combine_tessdata -e tessdata/eng.traineddata eng.lstm
  2. Generate files:

    tesstrain.run(
       fonts_directory=fonts_directory,
       fonts=[font_name],
       language_code='eng',
       linedata_only=True,
       langdata_directory=language_data_directory,
       tessdata_directory=tessdata_directory,
       save_box_tiff=True,
       maximum_pages=maximum_pages,
       output_directory=output_directory
    )
  3. Finetune: lstmtraining --continue_from eng.lstm --model_output font_name --traineddata tessdata/eng.traineddata --train_listfile eng.training_files.txt --max_iterations 10
  4. Convert to .traineddata file: lstmtraining --stop_training --continue_from font_name_checkpoint --traineddata tessdata/eng.traineddata --model_output target_path