tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

Migrating from tesstrain.sh #307

Closed stefan6419846 closed 1 year ago

stefan6419846 commented 2 years ago

I have used the tesstrain.sh approach (including tesstrain_utils.sh and language-specific.sh) for fine tuning an existing model for a specific font in the past. As this is deprecated with the corresponding Bash scripts having been removed from the tesseract repository, I wanted to try the new approach which redirects to this repository.

Looking at this repository, it seems to provide four different types of training support:

Given this situation, how do the four different "types" interact with each other? What is the correct approach to use for training, given that I want to avoid another deprecation after a short time?

Background: As a Python developer I considered using the approach from src/training, requiring less migration effort as well. But as this does not seem to be documented in the README, I am not sure whether this makes sense.

As an additional question: The module in src/training does not seem to be available as a regular Python package on PyPI, although it seems like it could be. Are there any plans to convert this to a library (leaving tesstrain as an entry point for standalone execution), making it easier to use this in own code without maintaining local copies?

stweil commented 2 years ago

Initially this repository contained the Makefile and a few Python scripts which were used by the Makefile. Its main purpose was training from scanned text lines with transcriptions ("ground truth"), either from scratch or finetuning of existing models.

The tesseract repository contained a shell script (plus helper scripts) for training with generated images. All standard models for Tesseract were trained with such artificial data. Initially that shell script supported training of new models for the "legacy" (Tesseract 3) recognizer. Later it was enhanced to support training for the LSTM (Tesseract 4) recognizer. And even later it was replaced by Python code which provided the same command line interface, but never implemented the "legacy" training. The shell scripts are removed in newer releases, and the Python code was moved to tesstrain.

Making a Python package which is published on PyPI is a good idea. It only has to be done ...

stefan6419846 commented 2 years ago

Thanks for the explanations about the Makefile (and the files inside the root directory) being intended for "real-life" training, while src/tesstrain keeping the artificial approach alive.

In my opinion (mostly being a Tesseract user instead of a Tesseract developer) these aspects should be represented in the directory structure as well. This might be achieved by moving all the sources for "real-life" training into an own subdirectory and maybe improving the naming for the artificial approach directory. Then each directory could have a dedicated README, while the global README provides some basic explanations.

Regarding the Python package I am going to open a new issue.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stefan6419846 commented 2 years ago

Is there any interest in actually cleaning up the directory structure and improving the corresponding documentation? If yes, does it make sense to track it in this issue, or should this rather be a new one?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.