madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.76k stars 715 forks source link

[Feature Request] Wrapper around training #508

Closed forzagreen closed 1 year ago

forzagreen commented 1 year ago

It would be great if pytesseract offers a wrapper around the training functionalities of Tesseract (https://github.com/tesseract-ocr/tesstrain) Since the training is not done often in Tesseract, the option can be added as a package extras, e.g. installed as pip install pytesseract[training]

stefan6419846 commented 1 year ago

What exactly are you looking for?

For the training with artificial data, there already is a Python package (https://github.com/tesseract-ocr/tesstrain/tree/main/src, tesstrain on PyPI with some smaller modifications, currently maintained/owned by me in a fork of the original code).

For the training with real data, there currently mostly is a Makefile. If I remember the discussions in some PRs correctly, one collaborator has some plans about moving everything to Python and providing it in one package, but there are no results for this at the moment.

That being said, I see no real value in pytesseract adding functionality like this.

forzagreen commented 1 year ago

Hi @stefan6419846 , thank you for sharing these information. I didn't know about this pypi package and the python code behind it.

The documentation of training is confusing and scattered between 3 repos (tesseract, tessdoc and tesstrain). It documentes only Makefiles. It's worth documenting the python options.

Thanks again. Closing this issue.

stefan6419846 commented 1 year ago

tessdoc documents the training process with the Python package in a basic manner without any actual references to tesstrain (or the tesstrain.sh script, which was the old way): https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html But yes, I already mentioned the not very clear docs in the past, but priority does not seem to be high for it and my experience is rather restricted to the training with artificial data.