tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

Question on handwriting OCR #276

Open Archilegt opened 3 years ago

Archilegt commented 3 years ago

Hello! I wanted to ask if it would be possible to train Tesseract to recognize the handwriting of a person. I have a collection of old handwritten letters by one person. I was thinking that I could take some of those letters, and A) make libraries with several examples of 1) individual characters (letters, symbols), 2) whole words, and 3) repetitive phrases. B) Feed those libraries into Tesseract and train the OCR with the source documents C) Validate the Tesseract OCR with another set of letters by the same person.

Is a workflow like the one described above already possible with Tesseract? If so, could someone please direct me to some user guide or documentation?

bertsky commented 3 years ago

Hi @Archilegt, sure, if you have suitable ground truth (i.e. training data, pairs of image and text for individual lines), you can do HTR with Tesseract, too. Modern OCR engines (based on recurrent neural networks) do not work on individual letters, but complete lines, though. See this explanation.

For a workflow, have a look at https://github.com/tesseract-ocr/tesstrain/wiki/German-Konzilsprotokolle

stweil commented 3 years ago

Yes, training of handwritten text lines is possible. But Tesseract's layout recognition is still unable to recognize and separate lines of handwriting, so you will need additional tools for that or do the line separation manually.

Archilegt commented 3 years ago

Hi @bertsky and @stweil, Many thanks for your help! The link to the German Konzilsprotokolle workflow is specially useful. The case I am working on is the Kurrentschrift / Sütterlinschrift of the German scientist Karl Wilhelm Verhoeff. I have scanned his letters across a few institutions and transcribed some. I have been thinking that it would be helpful to use what I have already transcribed to train Tesseract and then use it to transcribe the rest. I am happy to see that this is at least partially feasible! I will try to read more, starting with the links that you provided. Kind regards, Carlos

bertsky commented 3 years ago

We are doing something very similar currently – see here for details (in German).

Basically, if you want to follow above OCR-D based workflow (or variants of it with different preprocessing), you first need to create the PAGE-XML ground truth. There are various editing/annotation tools for that, I'd recommend LAREX right now. It assists you in editing page/line segmentation and line text. (As @stweil pointed out, it's difficult to get a good automatic line segmentation for handwriting currently – even with OCR-D means. So the GT editing will necessarily entail correcting both line segmentation and line text.)

Roughly (assuming you have an OCR-D installation):

1. Preprocessing

cd page-images
ocrd-import # creates a METS-XML, importing all images into a fileGrp OCR-D-IMG
ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl wolf # or a different algorithm...
ocrd-cis-ocropy-deskew -I OCR-D-BIN -O OCR-D-DESKEW -P level-of-operation page # if your images might be skewed
ocrd-anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP # if your images are not cropped already
ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-SEG -P level-of-operation page # or a segmenter better suited for handwriting like ocrd-kraken-segment with blla model
ocrd-dummy -I OCR-D-SEG -O OCR-D-GT # just copy for manual editing

2. GT transcription

docker pull bertsky/larex:dev
docker run -p 8080:8080 --name larex -v $PWD:/data bertsky/larex:dev &
chromium-browser http://localhost:8080/Larex

Now in LAREX' library, open your image/workspace directory (it should be detected as type METS, not flat), select the last fileGrp OCR-D-GT, and edit page by page (correcting segments → lines → text).

3. Post-processing and training

ocrd-segment-extract-lines -I OCR-D-GT -O OCR-D-GT-LINES
cp -r OCR-D-GT-LINES /path/to/tesstrain/data/verhoeff-ground-truth
cd /path/to/tesstrain
make training MODEL_NAME=verhoeff 

The latter command will train from scratch, which is expected to only work well if you have transcribed thousands of lines. If not, then either mix in the data from Konzilsprotokolle, or train Konzilsprotokolle as a separate model (say htr), and subsequently use that as a pretraining for your small dataset (by adding START_MODEL=htr to the above line).

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Archilegt commented 2 years ago

Hi @bertsky and @stweil, Maybe we could be project partners for the third phase of the Specialised Information Service Biodiversity Research (BIOfid) (https://www.biofid.de/en/). In the meeting yesterday with our text technology partners they confirmed that they are not working on text recognition. We are currently developing the ideas for the third project phase, as we will have the meeting with the advisory board at the end of November. In case that you are interested, please leave a link to your institutional profile, so that I can contact you via email. Kind regards, Carlos

stweil commented 2 years ago

Hi Carlos, we already work in DFG founded projects which are part of OCR-D. See our website for contact information.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.