tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

LSTMF file are not getting generated for some part of dataset #287

Closed shirish100 closed 2 years ago

shirish100 commented 2 years ago

I ve already trained an OCR model and its recognition seems to be far better than the its base model eng, however the trained model used PSM value of 13. But when the same thing is trained PSM 6 then for almost 5% of the data the lstmf files are not generated. Can you tell me the cause of it.

TheFattestTony commented 2 years ago

Hi. Maybe PSM6 mode has more dificulty than PSM13 when it comes to identify characters at images, therefore the boxes are not built properly.

wrznr commented 2 years ago

PSM 13 uses raw, unoptimized data while PSM 6 makes use of tesseract-internal functionality to preprocess the line images. Maybe something is odd with those 5 % which causes Tesseract to fail in preprocessing? Hard to guess without examples.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.