Training Tesseract OCR for a specific document

tesseract-ocr / tesstrain

Train Tesseract LSTM with make

Apache License 2.0

599 stars 178 forks source link

Training Tesseract OCR for a specific document #360

Open mumarsyal opened 8 months ago

mumarsyal commented 8 months ago

I have recently started learning and experimenting with Tesseract OCR. I have done a training for a new font using the tesstrain.

Now my use case is that I want to train Tesseract 5 for a specific document attached below.

Ptcl_bill_0000

I have found some articles and tutorials about training for new font or new language but I couldn't find something about training for a custom document.

Is it possible to train Tesseract 5 for my document? If yes, please give me some guidelines on how to proceed with this and if I need any other tools other than Tesseract itself to prepare training data.

I have Tesseract 5 installed on Ubuntu 22.04.

stefan6419846 commented 8 months ago

Could you please elaborate on what you are trying to achieve by training a specific document (type)? What do you expect to change compared to using the existing models?

mumarsyal commented 8 months ago

Thank you for your response @stefan6419846 .

I ran Tesseract default English model on this image and the output is very bad. So, I want to train Tesseract specifically for this document to improve the output but I don't know how I can generate the training dataset(line images, *.gt.txt & box files) from these images. If you could suggest me some tools to create the dataset from these images, that would be wonderful.

stefan6419846 commented 8 months ago

I have not tried it, but I would argue that better preprocessing on your side (feeding Tesseract with specific ROIs with appropriate preprocessing per ROI instead of the whole page, ...) might be easier and sufficient.

linxyu1 commented 7 months ago

Thank you for your response @stefan6419846 .

I ran Tesseract default English model on this image and the output is very bad. So, I want to train Tesseract specifically for this document to improve the output but I don't know how I can generate the training dataset(line images, *.gt.txt & box files) from these images. If you could suggest me some tools to create the dataset from these images, that would be wonderful.

hello,maybe you can use jtessboxeditor.but it is heavy workload.