tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
59.53k stars 9.23k forks source link

Dropout layers for Tesseract #4252

Open yaofuzhou opened 1 month ago

yaofuzhou commented 1 month ago

Your Feature Request

I am trying to implement the feature of dropout layers for Tesseract. For now, the hope is to enable something like, say, "Dr0.2" or so to the VGSLSpecs syntax. I implemented some of the code, but have encountered a few issues, and I figure this may be the place for discussion.

  1. The files I have edited are
 Changes to be committed:
   (use "git restore --staged <file>..." to unstage)
    new file:   ../src/lstm/dropout.cpp
    new file:   ../src/lstm/dropout.h

 Changes not staged for commit:
   (use "git add <file>..." to update what will be committed)
   (use "git restore <file>..." to discard changes in working directory)
    modified:   ../Makefile.am
    modified:   ../configure.ac (for my own environment and irrelevant to the new dropout feature)
    modified:   ../src/lstm/fullyconnected.cpp
    modified:   ../src/lstm/network.cpp
    modified:   ../src/lstm/network.h
    modified:   ../src/training/common/networkbuilder.cpp
    modified:   ../src/training/common/networkbuilder.h
  1. The code compiles but cannot run
  ~/Documents/OCR/tesstrain_units_6 (main*) » make training
  make[1]: Entering directory '~/Documents/OCR/tesstrain_units_6'
  ~/Documents/OCR/tesseract_dr/build/combine_lang_model \
    --input_unicharset data/units/unicharset \
    --script_dir data/langdata \
    --numbers data/units/units.numbers \
    --puncs data/units/units.punc \
    --words data/units/units.wordlist \
    --output_dir data \
     \
    --lang units
  dyld[91402]: symbol not found in flat namespace '__ZN9tesseract7Network11DeSerializeEPNS_5TFileE'
  make[1]: *** [dr_training.mk:40: data/units/units.traineddata] Abort trap: 6
  make[1]: Leaving directory '~/Documents/OCR/tesstrain_units_6'
  make: *** [Makefile:17: training] Error 2

This is not surprising, as I am sure there are additional and essential modifications needed on other parts of the codebase.

  1. It is obvious that I need to be able to disable the dropout feature for the deployed .trainedmodels, for which I may need to further modify network.cpp. I need to ask the community about the best practice in terms of adding the new flag or switch for this purpose.

  2. Ideally, I want to, when continuing training from a checkpoint, be able to adjust the dropout rate(s) to a different value(s), including setting it/them to 0 (perhaps when the training is converging). There is probably more than one way to do it, but I want to ask the community for the best practice.

  3. Let me know when you want to go over my already implemented modifications (that do not work yet).

amitdo commented 1 month ago

Let me know when you want to go over my already implemented modifications (that do not work yet).

I suggest to put it in a feature branch in your GitHub fork of Tesseract, so other people can see it.

amitdo commented 1 month ago

I reformatted your comment.

amitdo commented 1 month ago

CC @bertsky,

Maybe you can help @yaofuzhou with this new feature.

stweil commented 1 month ago

I just pushed my own unfinished efforts: https://github.com/stweil/tesseract/tree/dropout.

yaofuzhou commented 1 month ago

[Edited]

This is my implementation of the dropout feature so far - https://github.com/yaofuzhou/tesseract I have gone over @stweil 's code and it seems that we are trying to approach it in a very similar way.

There are aspects from @stweil 's code that I can learn from, and I will try to incorporate those into my code and give full credit to @stweil in the process.

My original description remains the same, namely -

  1. My code compiles but does not run. Specifically, the tesseract and lstmtraining binaries yield the error messages

    dyld[2292]: symbol not found in flat namespace '__ZN9tesseract7Network11DeSerializeEPNS_5TFileE'
    [1]    2292 abort      ./lstmtraining
    dyld[2292]: symbol not found in flat namespace '__ZN9tesseract7Network11DeSerializeEPNS_5TFileE'
    [1]    2329 abort      ./tesseract

    respectively, which means that I am probably missing something elsewhere in the Tesseract codebase. I tried to search for convolve and maxpool to see where these parallel components show up, but have not found the solution. This is probably where I need help the most.

  2. I need to implement a flag/switch somewhere so that the dropout mechanism is only activated during the training process (running the lstmtraining binary) and not during normal usage (running the tesseract binary).

  3. Ideally, I need to implement a mechanism to adjust the dropout_rate for each dropout layer when the lstmtraining binary continues from a checkpoint, as it may be desirable to turn off the dropout feature when the training converges to a good finish.