tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

Is it possible to train a model for multiple types of sources? #332

Open gabriel-fsa opened 1 year ago

gabriel-fsa commented 1 year ago

I would like to know how the default model is trained. If it is trained with several images (if so, what order of magnitude), or if images are generated automatically with different sources.

I want to train a model of mine from the pattern and using images it seems that low resolution images the pattern model reads better even adding more and more dataset. I'm training with images varying DPI using characters, words and phrases. Should I be doing it differently?

stweil commented 1 year ago

We don't know exactly how the standard models were trained because that was done by Google. Only some hints are available.

gabriel-fsa commented 1 year ago

But have you ever trained, or do you know of any case where, through a dataset of images, the assertiveness got to be greater than or equal to the standard model? This type of information is very scarce, I would like to have a north of the amount of a possible dataset to have a reasonably functional model.

stweil commented 1 year ago

Yes, we trained lots of models meanwhile. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR or https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for examples.

gabriel-fsa commented 1 year ago

Yes, we trained lots of models meanwhile. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR or https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for examples.

Wow what a fuck! How have I not seen this before.

But I still have some doubts:

1 - I saw that you use xml, in the dataset. Is this xml just to extract the words and use them as png and .gt.txt or is the xml used with the whole image?

2 - What is the order of magnitude of the dataset that you guys usually use (100k, 1M, 10M)?

3 - Do you do a lot of data augmentation to improve reading?

stweil commented 1 year ago
  1. The lines must be extracted from the PAGE XML files, and the same must be done for the page images. See example with extracted lines. For other GT data you still have to do this extraction.
  2. That depends. reichsanzeiger-gt for example has 119435 lines, GT4HistOCR has 313173 lines, but there are also some smaller data sets.
  3. No data augmentation.
gabriel-fsa commented 1 year ago

The last question, I swear.

Does it have much impact on assertiveness in training a model with multiple sources? Several images with different fonts, always keeping the proportion between them, of course.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.