Closed james-evy closed 2 years ago
Loading the fonts. The texgyre font is not put into /usr/share/fonts, so I created a symbolic link.
sudo apt install fonts-deva fonts-dejavu gsfonts ttf-mscorefonts-installer fonts-ebgaramond fonts-gfs-didot fonts-gfs-didot-classic fonts-junicode
sudo apt install fonts-noto-cjk fonts-takao-gothic fonts-vlgothic sudo apt install fonts-dejavu gsfonts ttf-mscorefonts-installer sudo apt install fonts-ebgaramond fonts-gfs-didot fonts-gfs-didot-classic fonts-junicode
sudo apt install fonts-texgyre sudo ln -s /usr/share/texmf/fonts/opentype/public/tex-gyre /usr/share/fonts/opentype/tex-gyre
I created my workspace in my home directory, grabbed a copy of this site:
mkdir workspace_tesseract cd workspace_tesseract
wget https://github.com/tesseract-ocr/tesstrain/archive/refs/heads/main.zip unzip main.zip rm -rf main.zip mv tesstrain-main tesstrain
cd tesstrain
mkdir -p data mkdir -p usr/share/tessdata
text2image is not well behaved when running more than one concurrently. I have not debugged it, but it seems there is a fight when cleaning out the fontconfig directory that is passed in with the "--fontconfig_tmpdir=/tmp/font_tmp8jw5kl5d". tesstrain.py should call text2image with a separate "--fontconfig_tmpdir" for each instance it fires up. Since each text2image is being called with a different font anyway, there is no sense having a common cache.
Created a new issue #298, with a patch.
requirements.txt is missing a dependancy. I lost my notes, but could figure it out if I rebuild by ubuntu from scratch. Will probably that later.
The sample text in langdata_lstm is too large to effeciently convert to lstmf files. When it's being processed by the tesseract binary, it takes 2gigs of memory for each process. I allocated 6 processors to my VM and wanted to utilize each. Even with giving the VM 12Gigs of memory, it was hitting swap and crawling. After running for 3 days, the linux would panic.
There is no reason to create such large tiff and box files. There is no advantage. So I modified the make file to break it up the text file into 10K lines, and create multiple ground-truth directories. I'm currently running it now. Out of 18 chunks, 15 went through, the other 4 crashed because of the previously mentioned text2image concureancy bug mentioned above. I then added a new entry into my make file to rerun them. They are running as I type this. so far so good. The next entry will be my current Makefile along with the "ground-truth-fix" entry to rerun these 4 chuncks.
I will have to modify the $(GROUND_TRUTH_DIR)-done: entry to combine all the eng.training_files.txt files into one list.train file.
Also to mention.
I did not compile tesseract on this machine. I just did an app install tesseract-ocr. I call the make file like this: make --debug=v -f ~/Makefile ground-truth-eval MODEL_NAME=eng make --debug=v -f ~/Makefile ground-truth MODEL_NAME=eng After fixing up the make file to combine the list of all *.lstmf files to the list.train file, I will call make --debug=v -f ~/Makefile ground-truth-done MODEL_NAME=eng
Then if all is well, I will actually train the neural network with: make --debug=v training MODEL_NAME=eng
If anyone with knowledge knows what is really in the published traineddata files, let me know.
View of "top" when creating the lstm files. 100% of 6 CPUs allocated to VM. Note: I changed "max_workers=2" to "max_workers=6" in tesstrain_utils.py. The memory doesn't get past 3% because the smaller files.
15342 james 20 0 305972 265224 11128 R 100.0 2.2 8:39.54 tesseract
15344 james 20 0 318372 252988 11084 R 100.0 2.1 8:26.62 tesseract
15350 james 20 0 247188 208504 11352 R 100.0 1.7 8:01.98 tesseract
15353 james 20 0 239524 200516 11296 R 100.0 1.6 6:53.79 tesseract
15343 james 20 0 245364 206412 11204 R 99.7 1.7 8:37.83 tesseract
15345 james 20 0 252596 213712 11156 R 98.7 1.7 8:22.47 tesseract
Another note, it takes 75 minutes to process 10,000 lines, using 32 fonts into .lstmf files, using 6 CPUs. So about a day. I have no idea how long it will take to train.
Also, I didn't mention, but the modified make file will create a ground-truth-eng-eval folder for evaluation. I used 10% of the text data to do this, instead of taking 10% of the .lstmf file, since originally, there would be one .lstmf file per font. Seems better to train with all fionts, and eval with all fonts.
Ran the training. Got down to 8.949 char error before iterations ran out. As it was exiting, the async tester thread spewed much stuff as it wasn't finished when the lstmtrainer cleaned up.
Finished! Error rate = 8.949 Empty truth string! Can't encode transcription: '(null)' in language '(null)' .... Empty truth string! Can't encode transcription: '(null)' in language '(null)' libpng error: Not a PNG file Error in pixReadMemPng: internal png error Error in pixReadMem: png: no pix returned src_pix != nullptr:Error:Assert failed:in file imagedata.cpp, line 234
See this line in lstmtraining.cpp: tester_callback = NewPermanentTessCallback(&tester, &tesseract::LSTMTester::RunEvalAsync); Maybe call something like this in lstmtester.cpp, in a loop while exiting: bool LSTMTester::LockIfNotRunning()
As it was exiting, the async tester thread spewed much stuff as it wasn't finished when the lstmtrainer cleaned up.
Please open a new issue in tesseract repo with details regarding this error. Will help in fixing training related code.
I'm trying to do the ill advised "rebuild the eng.traineddata from scratch"
Don't think anyone has had success in rebuilding eng.traineddata from scratch, though people do train for their own specific use cases for a single font.
It seems to me that you are using the larger training text from the repo but not the larger font list. For LSTM training, Ray is supposed to have used many many more fonts. See https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/okfonts.txt
The tesstrain.py scripts has a much smaller list of fonts, which were used for Tesseract 3 (AFAIK).
it seems there is a fight when cleaning out the fontconfig directory that is passed in with the "--fontconfig_tmpdir=/tmp/font_tmp8jw5kl5d". tesstrain.py should call text2image with a separate "--fontconfig_tmpdir" for each instance it fires up. Since each text2image is being called with a different font anyway, there is no sense having a common cache. I didn't fix this bug. I just rerun the fonts that failed.
It will be useful, if you create a PR to address this bug.
the modified make file will create a ground-truth-eng-eval folder for evaluation. I used 10% of the text data to do this, instead of taking 10% of the .lstmf file, since originally, there would be one .lstmf file per font. Seems better to train with all fionts, and eval with all fonts.
Please create a separate PR for this feature also.
Edit: The 90:10 split works when the training is done using single line images, which is what the tesstrain makefile is intended to work with. tesstrain.py is the python version of the old tesstrain.sh training method from Tesseract 3 days, which was modified for tesseract 4.
I have no idea how long it will take to train.
Probably days or weeks. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR and other pages in the wiki.
it seems there is a fight when cleaning out the fontconfig directory that is passed in with the "--fontconfig_tmpdir=/tmp/font_tmp8jw5kl5d". tesstrain.py should call text2image with a separate "--fontconfig_tmpdir" for each instance it fires up. Since each text2image is being called with a different font anyway, there is no sense having a common cache. I didn't fix this bug. I just rerun the fonts that failed.
It will be useful, if you create a PR to address this bug.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, I'm sure this is not the right place to put this, but will anyway. I'm trying to do the ill advised "rebuild the eng.traineddata from scratch". I am learning as I go. So far, I did do a rebuild from scratch with just a 20 or so lines of engish text. Got a very overfitted neural net, that ran without error, but produces interesting, but wrong OCR results. I'm now trying to use the https://github.com/tesseract-ocr/langdata_lstm data to recreate traineddata file from scratch. I'm doing this on a macbook pro, using VMWare with a current Ubuntu VM. I'm going to break this up by replying to this issue multiple times with the struggles I resolved on this path.