tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
604 stars 178 forks source link

lstmeval not matching up with what I see when running Tesseract command line #283

Closed hartjac23 closed 2 years ago

hartjac23 commented 2 years ago

I'm trying to train tesseract to handle custom font/color scheme. eng.traineddata works pretty well for me out of the box but misses a few crucial cases that I need to parse correctly, so I prepared my training data with png and txt files. When I run "make training MODEL_NAME=light-model START_MODEL=eng TESSDATA=eng_tessdata_best PSM=7 MAX_ITERATIONS=20000" i see it start with a surprisingly high error rate (100%) considering eng.traineddata works very well for me. I think I've pinpointed the problem to be a discrepency between tesseract output and lstmeval output. When I go into my tesseract repo and run " tesseract test.png test_output -psm 7" I get it to parse properly, but when I use eng.traineddata and lstmeval on that same image, I get a nonsensical output string. What am I doing wrong? It seems like lstm training has no correlation with the output I'm seeing from tesseract

Shreeshrii commented 2 years ago

Duplicate of issue https://github.com/tesseract-ocr/tesstrain/issues/110