Open josef821 opened 2 years ago
Use combine_tessdata to extract a traineddata file. Compare your ara/fas config file to the official one.
i was do that. fas has no config file and works good. ara has config but even without config file it works good. i extract all file and then traineddata only with lstm , lstm-recoder and lstm-unicharset and remove other files ( line version , wordlist etc ) but it still works good. I imagine it's all about training data. Do I need to refer to previous versions such as 4.00.00 and use this version like the official models?
If the official model works well without a config file and your custom model does not, I don't know what's causing this issue and how it can be solved.
Are the official best files done by Ray? Do you know how the images were produced? with text2image or self image line generator?
Are the official best files done by Ray?
Yes.
For your other questions, I don't know.
would you mind sharing on how to train tesseract using custom dataset?
Environment
Current Behavior:
i trained my own traineddata from scratch with 800K lines. in some image likes this it can't recognize column lines ( psm 6 or 4 ). it works good when i select every line separately. ray's Arabic and fas ( Farsi ) trained data works good ( it can recognize lines good )
Original Image for test :
Expected Behavior:
I imagined tesseract get lines by image processing and then every line will recognize with psm 7. but i see traineddata will affect on column line recognizer. What should I do in the training phase to solve such a problem?
Suggested Fix:
i try to adjust some parameters and i found this parameters. when i set each of them output will be better but how tessdata_best will works without this parameters? textord_min_xheight=0 textord_really_old_xheight=1 textord_old_xheight=1
i try to adjust some xheight to my training data but problem not solved. Files : My example ground truth : fas-ground-truth.zip ( numbers are all english ) My Traineddata : MyTrainedData.zip