tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

Info: tesstrain on Windows #212

Closed Shreeshrii closed 3 years ago

Shreeshrii commented 3 years ago

Came across a blog post detailing how to run tesstrain on Windows.

https://livezingy.com/train-tesseract-lstm-with-make-on-windows-2/

Related repo:

https://github.com/livezingy/tesstrain-win

The following relate to use of tesstrain.sh

https://livezingy.com/train-tesseract-lstm-with-tesstrain-sh-on-windows/

https://github.com/livezingy/tesstrainsh-win

Shreeshrii commented 3 years ago

Details about how the makefile works - you may need to use Google translate to convert page from chi_sim to English.

https://livezingy.com/how-the-makefile-in-tesstrain-win-work/

kba commented 3 years ago

This illustration is really helpful, maybe we can adapt it and add to README?

kba commented 3 years ago

https://livezingy.com/how-the-makefile-in-tesstrain-win-work/ is a fairly thorough explanation of how makefiles work in general and the tesstrain Makefile in particular. Would be really helpful to have this in english.

Shreeshrii commented 3 years ago

Google Translate version copied below for reference and improvement:

tesstrain-win can train Tesseract LSTM with make under Windows, according to the image and its corresponding text. It comes from Tesseract-ocr/tesstrain , the makefile and file structure have some changes. This article takes the makefile in tesstrain-win as an example to record the training process and working principle of Train Tesseract LSTM with make .

Note: In order to train Tesseract LSTM with make under windows recently, I first came into contact with the makefile. This is a beginner. This article is a personal study note and experience record. Please refer to it carefully. If there is a great god willing to give pointers on the errors or misunderstandings in the article Two, will not be very grateful.

References GNU make

Make command tutorial

Basic principles of makefile The code becomes an executable file, which is called compile; to compile this one first, or compile that one first (that is, the arrangement of compilation) is called build. Make is the most commonly used build tool. It was born in 1977 and is mainly used for C language projects. But in fact, any project that needs to be rebuilt as long as a certain file changes can be built with Make.

Irregularities do not form a circle, the build rules of make are written in the Makefile.

The Makefile file consists of a series of rules (rules). The form of each rule is as follows.

<target> : <prerequisites>
[tab] <commands>

The part before the colon in the first line above is called "target", and the part after the colon is called "prerequisites"; the second line must start with a tab key followed by "commands" .

"Target" is required and cannot be omitted; "Precondition" and "Command" are both optional, but at least one of the two must exist.

The pound sign (#) indicates a comment in the Makefile.

Each rule clarifies two things: what are the preconditions for building a goal, and how to build it.

The precondition is usually a set of file names separated by spaces. It specifies the criteria for determining whether the "target" should be rebuilt: as long as a pre-file does not exist or has been updated (the last-modification timestamp of the pre-file is newer than the target's timestamp), the "target" needs to be rebuilt Construct.

One or more target files of target depend on the files in prerequisites, and the generation rules are defined in the command. If there is more than one file in prerequisites that is newer than the target file, the command defined by command will be executed. This is the rule of Makefile. That is the core content of the Makefile.

How the makefile in tesstrain-win work According to GNU make , "-trace" can print out the command and result being executed on the command line at the same time during execution. From this, we can roughly understand the command execution order of Train Tesseract LSTM with make.

1. make training –trace In Train Tesseract LSTM with make, after making preparations, we will run the following commands on the command line:

make training –trace

According to the basic rules of makefile, the first target to be built is "training",

training: $(OUTPUT_DIR).traineddata

It does not have a command line, and its precondition is data/foo.traineddata, which does not exist. So the sequence is executed down.

*2. $(ALL_GT): $(wildcard $(GROUND_TRUTH_DIR)/.gt.txt)**

$(ALL_GT): $(wildcard $(GROUND_TRUTH_DIR)/*.gt.txt)
    #若不存在data/foo目录,则创建该目录
     @mkdir -p $(OUTPUT_DIR)
   #GROUND_TRUTH_DIR路径下找到所有的*.gt.txt文件,并取出相应文件的内容按顺序存储到data/foo/all-gt文件中,且去掉重复的内容。
     find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs cat | sort | uniq > "$@" 

$(ALL_GT) = data/foo/all-gt The file does not currently exist, but its precondition already exists. There is a prepared training data set in data/foo-ground-truth, which contains a .gt.txt file.

According to the basic rules of makefile, the timestamp of the precondition is newer than the target timestamp, so its command line will be executed to rebuild the target.

The task completed by the command line: find all files with the suffix *.gt.txt under the GROUND_TRUTH_DIR path, and take out the contents of the corresponding files and store them in the data/foo/all-gt file in order, and remove the duplicate content.

3. ifdef START_MODEL Since $(ALL_GT) is used as the precedent of the two targets under this condition, the execution will stop here.

ifdef START_MODEL
$(OUTPUT_DIR)/unicharset: $(ALL_GT)
     @mkdir -p data/$(START_MODEL)
     combine_tessdata -u $(TESSDATA)/$(START_MODEL).traineddata  data/$(START_MODEL)/$(MODEL_NAME)
     unicharset_extractor --output_unicharset "$(OUTPUT_DIR)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_GT)"
     merge_unicharsets data/$(START_MODEL)/$(MODEL_NAME).lstm-unicharset $(OUTPUT_DIR)/my.unicharset "$@"
else
$(OUTPUT_DIR)/unicharset: $(ALL_GT)
     @mkdir -p $(OUTPUT_DIR)
     unicharset_extractor --output_unicharset "$@" --norm_mode $(NORM_MODE) "$(ALL_GT)"
endif 

If START_MODEL has been assigned, execute the combine_tessdata/unicharset_extractor/merge_unicharsets commands respectively; if START_MODEL is not assigned , only execute unicharset_extractor.

combine_tessdata : Used to combine/extract/overwrite/list/compress the tessdata component in [lang].traineddata files. Through the -u command, all components can be decompressed to the specified path. These components include .lstm/.lstm-number-dawg/.lstm-punc-dawg/.lstm-recorder/.lstm-unicharset/.lstm-word-dawg /.version.

unicharset_extractor : According to the set parameters, extract the character set unicharset from .box or plain text files

merge_unicharsets: Combine the two character sets set in the parameter into one and store it in data/foo/unicharset.

4. .PRECIOUS: %.box After the execution of step 3 is completed, there is no prerequisite jump, so this statement is executed sequentially from the code that jumps from step 2.

The target that .PRECIOUS depends on has the following special handling: if make is killed or interrupted during the execution of its recipe, the target will not be deleted. If the target is an intermediate file, it will not be deleted after it is no longer needed.

.PRECIOUS: %.box
%.box: %.png %.gt.txt
     PYTHONIOENCODING=utf-8 python generate_line_box.py -i "$*.png" -t "$*.gt.txt" > "$@"
%.box: %.bin.png %.gt.txt
     PYTHONIOENCODING=utf-8 python generate_line_box.py -i "$*.bin.png" -t "$*.gt.txt" > "$@"
%.box: %.nrm.png %.gt.txt
     PYTHONIOENCODING=utf-8 python generate_line_box.py -i "$*.nrm.png" -t "$*.gt.txt" > "$@"
%.box: %.tif %.gt.txt
     PYTHONIOENCODING=utf-8 python $(GENERATE_BOX_SCRIPT) -i "$*.tif" -t "$*.gt.txt" > "$@" 

This piece of code performs different processing for different image formats. Our test data is .tif and .gt.txt, so the final code executed is:

PYTHONIOENCODING=utf-8 python generate_line_box.py -i "data/foo-ground-truth/FILENAME.tif" -t "data/foo-ground-truth/FILENAME.gt.txt" > "data/foo-ground-truth/FILENAME.box"

Its purpose is to call generate_line_box.py to generate the corresponding .box file based on the .tif and .gt.txt files. Compared with the jTessBoxEditor + lstmtraining solution, this code replaces the steps of manually adjusting and generating boxes through jTessBoxEditor . But the .box file generated here is not the location information of a single character, but the location information of the entire single line of text.

5. %.lstmf: %.box The precondition of %.box as the target %.lstmf has been updated, so the line of code will start to execute next.

@if test -f "$*.png"; then \
   image="$*.png"; \
elif test -f "$*.bin.png"; then \
   image="$*.bin.png"; \
elif test -f "$*.nrm.png"; then \
   image="$*.nrm.png"; \
else \
   image="$*.tif"; \
fi; \
set -x; \

tesseract "$${image}" $* --psm $(PSM) lstm.train   

Here first assign a value to the variable image according to different image formats in the test data set.

set -x; Start printing the executed command and its parameters from here.

Execute the training command tesseract of tesseract-ocr according to the set parameters, and use .tif and .box files to generate .lstmf files for lstm training.

After the execution of this sentence is completed, go back to step 4 to process the next picture, and so on, loop execution until all pictures in data/foo-ground-truth have generated the corresponding .box and lstm files.

6.$(ALL_LSTMF)

$(ALL_LSTMF): $(patsubst %.gt.txt,%.lstmf,$(wildcard $(GROUND_TRUTH_DIR)/*.gt.txt))
@mkdir -p $(OUTPUT_DIR)
find $(GROUND_TRUTH_DIR) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@" 

%.lstmf is used as one of the preconditions of the command line. After the 4th and 5th loops are executed, the execution will end here. Before explaining this line of code, let's learn:

$(patsubst pattern,replacement,text)

patsubst Meaning: Find words in text (words are separated by "space", "Tab" or "carriage return" and "newline") whether they match the pattern, and if they match, they are replaced by replacement. Here, the pattern can include the wildcard "%", which means a string of any length. If replacement also contains "%", then the "%" in replacement will be the string represented by the "%" in the pattern.

The function returns the replaced string.

Example:

$(patsubst %.c,%.o, a.c b.c)

Replace the words in the string "ac bc" with the pattern [%.c] with [%.o], and the return result is "ao bo"

$(patsubst %.gt.txt,%.lstmf,$(wildcard $(GROUND_TRUTH_DIR)/*.gt.txt))

According to the above explanation, the meaning of the code where patsubst in the makefile is: replace the file conforming to the pattern [%.gt.txt] with [%.lstmf], and the text content is the file name of the corresponding .gt.txt.

find $(GROUND_TRUTH_DIR) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"

The actual execution code of the above code is:

find data/foo-ground-truth -name '*.lstmf' -exec echo {} \; | sort -R -o "data/foo/all-lstmf"

Find all files with the suffix ".lstmf" under the path data/foo-ground-truth, and write the file names of these files into data/foo/all-lstmf. After this line of code is executed, the file will be in all-lstmf Part of the content is:

data/ground-truth/alexis_ruhe01_1852_0018_022.lstmf
data/ground-truth/alexis_ruhe01_1852_0035_019.lstmf
data/ground-truth/alexis_ruhe01_1852_0087_027.lstmf 

7. $(OUTPUT_DIR)/list.train: $(ALL_LSTMF) According to the relationship between the target and the precondition, this line of code is executed next. In this block of code, the two targets have a common precondition:

$(OUTPUT_DIR)/list.eval \
$(OUTPUT_DIR)/list.train: $(ALL_LSTMF)
@mkdir -p $(OUTPUT_DIR)
@total=$$(wc -l < $(ALL_LSTMF)); \
train=$$(echo "$$total * $(RATIO_TRAIN) / 1" | bc); \
test "$$train" = "0" && \
echo "Error: missing ground truth for training" && exit 1; \
eval=$$(echo "$$total - $$train" | bc); \
test "$$eval" = "0" && \
echo "Error: missing ground truth for evaluation" && exit 1; \
set -x; \
head -n "$$train" $(ALL_LSTMF) > "$(OUTPUT_DIR)/list.train"; \
tail -n "$$eval" $(ALL_LSTMF) > "$(OUTPUT_DIR)/list.eval" 

In the above code segment, the purpose of wc -l is to count the number of lines in a file.

Therefore, the general idea of ​​this code is: Calculate the number of rows in ALL_LSTMF, and allocate the amount of training data train and the amount of evaluation data eval according to the setting value of the variable RATIO_TRAIN. Either train or eval is 0, and the execution of the makefile will be terminated Report an error.

If train and eval meet the conditions, write the first train line of ALL_LSTMF into the file data/foo/list.train, and write the last eval line of ALL_LSTMF into the file data/foo/list.eval.

8. $(PROTO_MODEL) The list.train and list.eval generated in step 7 are not used as preconditions for other targets, so they are executed sequentially from the end of step 5.

However, I am not sure if lines 206~225 are executed, when they are executed, and what they are used for. However, the code of this part is not printed out by make trainging -trace. Those who know, please feel free to enlighten me.

So here we go straight to line 228.

proto-model: $(PROTO_MODEL)
$(PROTO_MODEL): $(OUTPUT_DIR)/unicharset data/radical-stroke.txt
combine_lang_model \
--input_unicharset $(OUTPUT_DIR)/unicharset \
--script_dir data \
--numbers $(NUMBERS_FILE) \
--puncs $(PUNC_FILE) \
--words $(WORDLIST_FILE) \
--output_dir data \
$(RECODER) \
--lang $(MODEL_NAME) 

This piece of code is used to generate an initial training data file (.traineddata file), which can be used to train a neural network model based on LSTM. It takes a unicharset and a set of optional word lists as input.

$(PROTO_MODEL): $(OUTPUT_DIR)/unicharset data/radical-stroke.txt : Makefile will look for the radical-stroke.txt file under the path data, if not, it will be downloaded automatically through wget, but in order to train smoothly, it is recommended to advance Download the corresponding file and place it under the execution path.

script_dir data : Should point to the directory containing *.unicharset files. For English-based and other Latin-based scripts, the file is Latin.unicharset. It is recommended to download it in advance and place it in the specified path.

After some students have finished training in this way, the following warning may appear:

Failed to load any lstm-specific dictionaries for lang led!!

This warning is related to the lines of codes/puncs/words in this code segment. For related discussion, please refer to Failed to load any lstm-specific dictionaries for lang xxx .

I personally think that there is a bug in the makefile of Tesseract-ocr/tesstrain . Although the files corresponding to the numbers/puncs/words parameters are optional, if these files are missing, the font library obtained by training can be recognized normally, but it will report "Failed to load any lstm-specific dictionaries for lang led !!"caveat. In the makefile, if START_MODEL is not assigned, the makefile will not automatically generate any related files; if it has been assigned, the makefile will generate .lstm-number-dawg/.lstm-punc-dawg/.lstm-word-dawg as suffixes Named files, these files are not needed for training, they need more commands to convert them into the correct files.

In tesstrain-win, I did not modify this, but you prepare the numbers/puncs/words related files of the corresponding basic fonts during training and put them in the specified path of tesstrain-win makefile to avoid this. The bug.

9 lstmtraining The file generated in step 8 is the front of the following code segment, as follows:

$(LAST_CHECKPOINT): unicharset lists $(PROTO_MODEL)
@mkdir -p $(OUTPUT_DIR)/checkpoints
lstmtraining \
--debug_interval $(DEBUG_INTERVAL) \
--traineddata $(PROTO_MODEL) \
--old_traineddata $(TESSDATA)/$(START_MODEL).traineddata \
--continue_from data/$(START_MODEL)/$(MODEL_NAME).lstm \
--model_output $(OUTPUT_DIR)/checkpoints/$(MODEL_NAME) \
--train_listfile $(OUTPUT_DIR)/list.train \
--eval_listfile $(OUTPUT_DIR)/list.eval \
--max_iterations $(MAX_ITERATIONS) 

This piece of code officially starts training LSTM, the actual execution statement will be slightly different depending on whether START_MODEL is assigned or not. The above code is the code that will be executed when START_MODEL is assigned . The parameters that need attention:

--Traineddata: This parameter corresponds to the path of the initial training file generated by combine_lang_model;

--Old_traineddata: This parameter corresponds to the path of the basic font file, and in this article corresponds to tessdata/eng.traineddata.

If START_MODEL has been assigned, the current training has a basic font, which belongs to Fine-tune; if START_MODEL is not assigned, it belongs to TRAIN FROM SCRATCH.

Compared with the two methods, there will be the following differences:

The two parameters unique to Fine-tune are the path of the basic font and the starting point of training. Training will start from the .lstm file of the basic font.

--old_traineddata $(TESSDATA)/$(START_MODEL).traineddata \
--continue_from data/$(START_MODEL)/$(MODEL_NAME).lstm \ 

Two parameters unique to TRAIN FROM SCRATCH are used to set the relevant parameters of the neural network and the proportion of memory that can be occupied during the training process.

--net_spec "$(subst c###,c`head -n1 $(OUTPUT_DIR)/unicharset`,$(NET_SPEC))" \
--learning_rate 20e-4 \ 

10 stop_training After the training is complete, merge the checkpoint file and the .traineddata file into a new .traineddata file, and the training is complete.

lstmtraining \
--stop_training \
--continue_from $(LAST_CHECKPOINT) \
--traineddata $(PROTO_MODEL) \
--model_output $@ 

This concludes this article, thank you for reading, thank you for your support.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.