tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.45k stars 9.53k forks source link

My trained model can't recognize well some lines #3932

Open josef821 opened 2 years ago

josef821 commented 2 years ago

Environment

Current Behavior:

i trained my own traineddata from scratch with 800K lines. in some image likes this it can't recognize column lines ( psm 6 or 4 ). it works good when i select every line separately. ray's Arabic and fas ( Farsi ) trained data works good ( it can recognize lines good )

MyIncorrect Original Image for test : 100

Expected Behavior:

I imagined tesseract get lines by image processing and then every line will recognize with psm 7. but i see traineddata will affect on column line recognizer. What should I do in the training phase to solve such a problem?

Suggested Fix:

i try to adjust some parameters and i found this parameters. when i set each of them output will be better but how tessdata_best will works without this parameters? textord_min_xheight=0 textord_really_old_xheight=1 textord_old_xheight=1

i try to adjust some xheight to my training data but problem not solved. Files : My example ground truth : fas-ground-truth.zip ( numbers are all english ) My Traineddata : MyTrainedData.zip

amitdo commented 2 years ago

Use combine_tessdata to extract a traineddata file. Compare your ara/fas config file to the official one.

josef821 commented 2 years ago

i was do that. fas has no config file and works good. ara has config but even without config file it works good. i extract all file and then traineddata only with lstm , lstm-recoder and lstm-unicharset and remove other files ( line version , wordlist etc ) but it still works good. I imagine it's all about training data. Do I need to refer to previous versions such as 4.00.00 and use this version like the official models?

amitdo commented 2 years ago

If the official model works well without a config file and your custom model does not, I don't know what's causing this issue and how it can be solved.

josef821 commented 2 years ago

Are the official best files done by Ray? Do you know how the images were produced? with text2image or self image line generator?

amitdo commented 2 years ago

Are the official best files done by Ray?

Yes.

For your other questions, I don't know.

ramdhan1989 commented 2 years ago

would you mind sharing on how to train tesseract using custom dataset?