tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Fine tuning Training related questions from forum #91

Closed Shreeshrii closed 4 years ago

Shreeshrii commented 5 years ago

https://groups.google.com/d/msg/tesseract-ocr/be4-rjvY2tQ/1bvuGMF5BwAJ by @AyushP123

Here is the link of the images for which no lsmtf files were generated -> https://drive.google.com/drive/folders/1VDBPB_k-oOXbWUI3zIlB3ljuyIlOkoMK?usp=sharing. Here is the Makefile that I used for generating lstmf files ->https://drive.google.com/open?id=15vvRMM03AOqoHKecEIx8NRTeU0y_kREy. I used Lorenzo's suggestion to create another target "train-lists" to avoid creating the training and the eval list again and again. Tesseract Version: 4.1.0 I am using https://github.com/tesseract-ocr/tesstrain/blob/master/generate_line_box.py to generate .box files. My images are in .tif format. I am saving my images using OpenCV imwrite. I have a few questions: In the link provided by Shree ->https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#tesseract-fails-to-create-lstm-files. It says that .lstmf are not generated for some images if you use the default list.train settings. Using PSM=13 helps build those lstmf files, whereas using PSM= 6 or 7 ignores them. Any clues as to why that is the case??, Tesseract does give me output text for the images for PSM values 6,7 and 13. If I use PSM 13 for generating the lstmf files used for training, will it be okay to use PSM values 6 and 7 while testing. How can I check the contents of lstmf files to see if they contain the ground truth text info and the image data correctly?? Side Questions: lstmtraining saves the checkpoints in the following format: loss_iteration. It saves the checkpoints for a few iterations with the best loss ( apart from eng_checkpoint which contains the metadata I guess ). Is the loss calculated on the traininging data or the evaluation data??. Is there a way to save all checkpoints Side Questions: Does lstmeval use the psm value with which the lsmtf file was generated for evaluation??. I know its a lot of questions and doubts. I thank you for your time in helping me out.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

wrznr commented 4 years ago

@AyushP123 I will try to answer some of your questions:

If I use PSM 13 for generating the lstmf files used for training, will it be okay to use PSM values 6 and 7 while testing.

From my experience, yes! PSM is a preprocessing parameter. Given tesseract --help-extra,

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

it does not make much sense to use PSM=6 for training since training is done on line level. The quality of your input images is really low. The problems with PSM=7 might stem from failures during the Tesseract-internal preprocessing. PSM=13 completely bypasses this processing step, effectively feeding the raw images to the LSTM file generation.

How can I check the contents of lstmf files to see if they contain the ground truth text info and the image data correctly??

Not possible as far as I know. @stweil?

Is the loss calculated on the traininging data or the evaluation data??

I think the iteration-wise loss is calculated on the training data. @bertsky Can you confirm?

Is there a way to save all checkpoints?

From https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#iterations-and-checkpoints and lstmtraining --help, I do not see a parameter to adjust the “checkpoint output rate”.

Does lstmeval use the psm value with which the lsmtf file was generated for evaluation??

I do not know. Maybe @egorpugin or @stweil do?

stweil commented 4 years ago

How can I check the contents of lstmf files to see if they contain the ground truth text info and the image data correctly??

It is currently not supported, but there exists a related feature request for Tesseract: https://github.com/tesseract-ocr/tesseract/issues/2669.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.