tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

Documenting the bug and isses trying to rebuild eng.traineddata from scratch form langdata_lstm #292

Closed james-evy closed 2 years ago

james-evy commented 2 years ago

Hi, I'm sure this is not the right place to put this, but will anyway. I'm trying to do the ill advised "rebuild the eng.traineddata from scratch". I am learning as I go. So far, I did do a rebuild from scratch with just a 20 or so lines of engish text. Got a very overfitted neural net, that ran without error, but produces interesting, but wrong OCR results. I'm now trying to use the https://github.com/tesseract-ocr/langdata_lstm data to recreate traineddata file from scratch. I'm doing this on a macbook pro, using VMWare with a current Ubuntu VM. I'm going to break this up by replying to this issue multiple times with the struggles I resolved on this path.

james-evy commented 2 years ago

Loading the fonts. The texgyre font is not put into /usr/share/fonts, so I created a symbolic link.

sudo apt install fonts-deva fonts-dejavu gsfonts ttf-mscorefonts-installer fonts-ebgaramond fonts-gfs-didot fonts-gfs-didot-classic fonts-junicode

sudo apt install fonts-noto-cjk fonts-japanese-mincho.ttf fonts-takao-gothic fonts-vlgothic

sudo apt install fonts-noto-cjk fonts-takao-gothic fonts-vlgothic sudo apt install fonts-dejavu gsfonts ttf-mscorefonts-installer sudo apt install fonts-ebgaramond fonts-gfs-didot fonts-gfs-didot-classic fonts-junicode

sudo apt install fonts-texgyre sudo ln -s /usr/share/texmf/fonts/opentype/public/tex-gyre /usr/share/fonts/opentype/tex-gyre

james-evy commented 2 years ago

I created my workspace in my home directory, grabbed a copy of this site:

mkdir workspace_tesseract cd workspace_tesseract

wget https://github.com/tesseract-ocr/tesstrain/archive/refs/heads/main.zip unzip main.zip rm -rf main.zip mv tesstrain-main tesstrain

cd tesstrain

mkdir -p data mkdir -p usr/share/tessdata

james-evy commented 2 years ago

text2image is not well behaved when running more than one concurrently. I have not debugged it, but it seems there is a fight when cleaning out the fontconfig directory that is passed in with the "--fontconfig_tmpdir=/tmp/font_tmp8jw5kl5d". tesstrain.py should call text2image with a separate "--fontconfig_tmpdir" for each instance it fires up. Since each text2image is being called with a different font anyway, there is no sense having a common cache.

Created a new issue #298, with a patch.

james-evy commented 2 years ago

requirements.txt is missing a dependancy. I lost my notes, but could figure it out if I rebuild by ubuntu from scratch. Will probably that later.

james-evy commented 2 years ago

The sample text in langdata_lstm is too large to effeciently convert to lstmf files. When it's being processed by the tesseract binary, it takes 2gigs of memory for each process. I allocated 6 processors to my VM and wanted to utilize each. Even with giving the VM 12Gigs of memory, it was hitting swap and crawling. After running for 3 days, the linux would panic.

There is no reason to create such large tiff and box files. There is no advantage. So I modified the make file to break it up the text file into 10K lines, and create multiple ground-truth directories. I'm currently running it now. Out of 18 chunks, 15 went through, the other 4 crashed because of the previously mentioned text2image concureancy bug mentioned above. I then added a new entry into my make file to rerun them. They are running as I type this. so far so good. The next entry will be my current Makefile along with the "ground-truth-fix" entry to rerun these 4 chuncks.

I will have to modify the $(GROUND_TRUTH_DIR)-done: entry to combine all the eng.training_files.txt files into one list.train file.

james-evy commented 2 years ago

Makefile.txt

james-evy commented 2 years ago

Also to mention.

I did not compile tesseract on this machine. I just did an app install tesseract-ocr. I call the make file like this: make --debug=v -f ~/Makefile ground-truth-eval MODEL_NAME=eng make --debug=v -f ~/Makefile ground-truth MODEL_NAME=eng After fixing up the make file to combine the list of all *.lstmf files to the list.train file, I will call make --debug=v -f ~/Makefile ground-truth-done MODEL_NAME=eng

Then if all is well, I will actually train the neural network with: make --debug=v training MODEL_NAME=eng

james-evy commented 2 years ago

If anyone with knowledge knows what is really in the published traineddata files, let me know.

james-evy commented 2 years ago

View of "top" when creating the lstm files. 100% of 6 CPUs allocated to VM. Note: I changed "max_workers=2" to "max_workers=6" in tesstrain_utils.py. The memory doesn't get past 3% because the smaller files.

15342 james 20 0 305972 265224 11128 R 100.0 2.2 8:39.54 tesseract
15344 james 20 0 318372 252988 11084 R 100.0 2.1 8:26.62 tesseract
15350 james 20 0 247188 208504 11352 R 100.0 1.7 8:01.98 tesseract
15353 james 20 0 239524 200516 11296 R 100.0 1.6 6:53.79 tesseract
15343 james 20 0 245364 206412 11204 R 99.7 1.7 8:37.83 tesseract
15345 james 20 0 252596 213712 11156 R 98.7 1.7 8:22.47 tesseract

james-evy commented 2 years ago

Another note, it takes 75 minutes to process 10,000 lines, using 32 fonts into .lstmf files, using 6 CPUs. So about a day. I have no idea how long it will take to train.

Also, I didn't mention, but the modified make file will create a ground-truth-eng-eval folder for evaluation. I used 10% of the text data to do this, instead of taking 10% of the .lstmf file, since originally, there would be one .lstmf file per font. Seems better to train with all fionts, and eval with all fonts.

james-evy commented 2 years ago

Ran the training. Got down to 8.949 char error before iterations ran out. As it was exiting, the async tester thread spewed much stuff as it wasn't finished when the lstmtrainer cleaned up.

Finished! Error rate = 8.949 Empty truth string! Can't encode transcription: '(null)' in language '(null)' .... Empty truth string! Can't encode transcription: '(null)' in language '(null)' libpng error: Not a PNG file Error in pixReadMemPng: internal png error Error in pixReadMem: png: no pix returned src_pix != nullptr:Error:Assert failed:in file imagedata.cpp, line 234

See this line in lstmtraining.cpp: tester_callback = NewPermanentTessCallback(&tester, &tesseract::LSTMTester::RunEvalAsync); Maybe call something like this in lstmtester.cpp, in a loop while exiting: bool LSTMTester::LockIfNotRunning()

Shreeshrii commented 2 years ago

As it was exiting, the async tester thread spewed much stuff as it wasn't finished when the lstmtrainer cleaned up.

Please open a new issue in tesseract repo with details regarding this error. Will help in fixing training related code.

Shreeshrii commented 2 years ago

I'm trying to do the ill advised "rebuild the eng.traineddata from scratch"

Don't think anyone has had success in rebuilding eng.traineddata from scratch, though people do train for their own specific use cases for a single font.

use the https://github.com/tesseract-ocr/langdata_lstm

It seems to me that you are using the larger training text from the repo but not the larger font list. For LSTM training, Ray is supposed to have used many many more fonts. See https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/okfonts.txt

The tesstrain.py scripts has a much smaller list of fonts, which were used for Tesseract 3 (AFAIK).

Shreeshrii commented 2 years ago

it seems there is a fight when cleaning out the fontconfig directory that is passed in with the "--fontconfig_tmpdir=/tmp/font_tmp8jw5kl5d". tesstrain.py should call text2image with a separate "--fontconfig_tmpdir" for each instance it fires up. Since each text2image is being called with a different font anyway, there is no sense having a common cache. I didn't fix this bug. I just rerun the fonts that failed.

It will be useful, if you create a PR to address this bug.

Shreeshrii commented 2 years ago

the modified make file will create a ground-truth-eng-eval folder for evaluation. I used 10% of the text data to do this, instead of taking 10% of the .lstmf file, since originally, there would be one .lstmf file per font. Seems better to train with all fionts, and eval with all fonts.

Please create a separate PR for this feature also.

Edit: The 90:10 split works when the training is done using single line images, which is what the tesstrain makefile is intended to work with. tesstrain.py is the python version of the old tesstrain.sh training method from Tesseract 3 days, which was modified for tesseract 4.

Shreeshrii commented 2 years ago

I have no idea how long it will take to train.

Probably days or weeks. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR and other pages in the wiki.

james-evy commented 2 years ago

it seems there is a fight when cleaning out the fontconfig directory that is passed in with the "--fontconfig_tmpdir=/tmp/font_tmp8jw5kl5d". tesstrain.py should call text2image with a separate "--fontconfig_tmpdir" for each instance it fires up. Since each text2image is being called with a different font anyway, there is no sense having a common cache. I didn't fix this bug. I just rerun the fonts that failed.

It will be useful, if you create a PR to address this bug.

298

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.