tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
640 stars 190 forks source link

Issue with Tesseract 5 Training: Multiple Lines or Drop Caps Handling #404

Open 4F2E4A2E opened 1 week ago

4F2E4A2E commented 1 week ago

How can Tesseract recognize drop caps?

I am trying to train Tesseract to recognize drop caps in paragraphs. However, Tesseract v5 does not support multiline training. How can I achieve this?

Drop caps examples: https://support.microsoft.com/en-us/office/insert-a-drop-cap-817fd19f-40fe-4b73-95e8-f3c0f5e01278 image

drop caps data-set examples: drop_caps_data_set_example.zip

tesseract --version 
tesseract 5.5.0-1-g43b8d
 leptonica-1.85.1
  libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
 Found NEON
 Found OpenMP 201511
 Found libcurl/7.74.0 OpenSSL/1.1.1w zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3