tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.32k stars 9.52k forks source link

Entire lines of text missing. Different missing when psm = 3, 6, 11 #1339

Open zc813 opened 6 years ago

zc813 commented 6 years ago

Environment

Current Behavior:

Brief description:

  1. One or more entire lines are missing when recognizing Tibetan.
  2. Different lines are missing when psm = 3, 6, or 11.
  3. If the image is slightly rotated or cropped, the missing line might come back.
  4. When compiling from source after the latest commit #1264 yesterday, missing lines remain the same, while recognized lines are more complete.
  5. When using a specially trained model, the lines that are missing might differ.
  6. Similar issue: 6.1. #538 psm 3 and psm 6 skip different parts of text based on font size 6.2. #681 LSTM: Words dropped during recognition (tried the solution, does not fix this problem) 6.3. #1319 Page Layout Issues

Test image:

https://user-images.githubusercontent.com/15245190/36480676-2820ca12-1748-11e8-9964-7c45a86426a5.png

Recognized with tessdata_best/bod.traineddata. First 3 lines:

PSM==6 01 ༄༅། །ཕམ་གྱི་གསུང་ལྡེབ། ༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ། ༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ། ༢ ༈ 02 (2nd line missing) 03 (3rd line missing)

PSM==11 All lines are complete but some are shattered and more inaccurate.

PSM==3 01 ༄༅། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ལམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ 02 (2nd line missing) 03 པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུངྱེ་ཤྲཱིའི་གསུང་ལྡེབ།༡ སློབ་དཔོན་ཆེན་པོ་རྡོ་རྗེ་གདན་པ་ཇོ་བོ་པུའངྱེ་ཤྲིས་མཛད་པའི་དགེ་བསྟེན་སྡོམ་པའི་རྣམ་པར་བཞག་པ་ཚིགས་སུ་

PSM==3, same image but slightly rotated and cropped https://user-images.githubusercontent.com/15245190/36482692-13cdb550-174f-11e8-9378-b8617342594c.png

01 ༄༅།། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ 02 གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་ 03 པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུྱེ་ཤྲིའི་གསུང་ལྡེབ།༡ Another test image with its fourth line missing: https://user-images.githubusercontent.com/15245190/36481051-87d49898-1749-11e8-9fb0-cfa4334d2445.png

Do you have any idea? or any suggestion what I should do? Thanks a lot! @Shreeshrii @amitdo

zc813 commented 6 years ago

Supplement: On the first image, the 2nd line remains ignored even if I masked the 1st or the 3rd line. (not cropped nor resized)

Shreeshrii commented 6 years ago

Tibetan1-Line2.txt tibetan1-line2

གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་

Shreeshrii commented 6 years ago

I think the page segmentation is not working because the text lines are too close and the diacritics are merging with previous/next line.

There is a config variable which can be tried for this -

# extra space to allow for diacritics above and below the characters
textord_min_linesize 2.5

It works well with your cropped first image - I think the slight white border around the image helps too.

I used the following command:

tesseract Tibetan1.png Tibetan -l bod --psm 6 -c textord_min_linesize=2.5

Here is the output:

༄༅། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ
གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་
པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུྱེ་ཤྲིའི་གསུང་ལྡེབ།༡ སློབ་དཔོན་ཆེན་ཕོ་རྡོ་རྗེ་གདན་པ་ཇོ་བོ་པུྱེ་ཤྲིས་མཛད་པའི་དགེ་བསྟེན་སྡོམ་པའི་རྣམ་པར་བཞག་པ་ཚིགས་སུ་
བཅད་པ་བརྒྱད་པ་འཆད་པ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཁྱིམ་པ་རྗེས་སུ་འཛིན་པ་སོ་སོར་ཐར་པའི་སྡོམ་པའི་ཆོ་ག་ཐར་ལམ་རབ་གསལ་འགྱུར་མེད་རྡོ་རྗེའི་གསུང་ལྡེབ།༡ ༢ འཕགས་པ་
གཞི་ཐམས་ཅད་ཡོད་པར་སྨྲ་བའི་དགེ་ཚུལ་གྱི་ཚིག་ལེའུར་བྱས་པ་ལྡེབ།° དགེ་སློང་ཕའི་སོ་སོར་ཐར་པའི་མདོ་རྩ་བ་ལྡེབ།༣༦ རབ་བྱུང་གི་གཞིའི་ཆོ་ག་རིན་ཆེན་ཐེམ་སྐས་དྷརྨ་ཤྲིའི་
གསུང་ལྡེབ།༤༣ ལོ་མིང་རེའུ་མིག་ལྡེབ།༡ སྡོམ་པ་འབུལ་ཆོག་ལྡེབ།༢ བསླབ་པ་ཡོངས་སུ་སྦྱོང་བ་གཞི་གསུམ་གྱི་ཆོ་ག་ཐར་གླིང་དུ་བགྲོད་པའི་གྲུ་ཆེན་དུས་བརྗོད་རེའུ་མིག་བཅས་


If most text is like this, it should be added to bod.config file, otherwise just use config variable as part of command.

zc813 commented 6 years ago

Hi, @Shreeshrii Thanks for your kind reply! I tried your solution. Actually, the cropped picture worked even without this configuration. When using the uncropped picture, setting this config variable worked only when textord_min_linesize is exactly 0.82. Neither 0.81 or 0.83 works. And this value depends on the picture. Do you have any idea? Thanks very much!

zc813 commented 6 years ago

For this picture, the textord_min_linesize has to be set to a number between 0.96 and 0.99. Neither smaller or greater values work: 1_page_075 Greater values cause incomplete results, while smaller values lead to wrong recognition.

Again, thanks a lot!

Shreeshrii commented 6 years ago

And this value depends on the picture. Do you have any idea?

Sorry. Don't know how the page layout analysis works.

Shreeshrii commented 6 years ago

@zdenop Label with

4.0x Accuracy

yurytch commented 6 years ago

I don't know if this is the right place, but I get missing words, and even several words at once, in the German text, processing with tessdata_best. I can provide the scan in question if necessary, it's in public domain.

Shreeshrii commented 6 years ago

@yurytch Yes, please provide the image so that we can test with the latest version.

yurytch commented 6 years ago

Fine, only I can't find where do I attach the files here. So I've put the image and text OCR'ed from it on the cloud. The tesseract was built from source from git checkout 2018-01-06, used with tessdata_best. The '***' in the .TXT were added by hand, to mark where the letters or complete words were dropped out without any indication from tesseract. https://yadi.sk/d/G2scDhj53TsU52 https://yadi.sk/i/JTh7Ixnv3TsTxY

amitdo commented 6 years ago

Please try with the latest commit.

Shreeshrii commented 6 years ago

@yurytch The image is 6MB+ jp2 file, yet the clarity in image is not there. I converted to png for testing, since I havent built leptonica with jp2 support.

@amitdo Tried with latest commit from yesterday. OCRed files attached.

ZeitschriftFuerHistorischeWaffenkunde5_0122-tessdata_best-deu-1.txt

ZeitschriftFuerHistorischeWaffenkunde5_0122-tessdata_fast-deu-1.txt

yurytch commented 6 years ago

@Shreeshrii Yes, thank you very much. Have completed the test run with the today's git C/O right now (with the JP2-enabled leptonica, as before). Those drop-outs are gone now. 'My' results are different from 'yours' (was that to be expected?), not always to the good. For the reference: https://yadi.sk/i/bAykSKIW3TsjBS

Shreeshrii commented 6 years ago

'My' results are different from 'yours' (was that to be expected?), not always to the good.

Possible, because I used a different version of image. Though there are too many differences.

I will install jp2 library and try again.

Shreeshrii commented 6 years ago

@yurytch I am attaching the png version that I used. zeitschriftfuerhistorischewaffenkunde5_0122

Shreeshrii commented 6 years ago

I will install jp2 library and try again.

Not successful in building leptonica with jp2. So trying to install from ppa ...

@AlexanderP Your ppa has both openjpeg2 and leptonlib. I installed liblept and libleptonica-dev from there as well as libopenjp2-7 libopenjp2-7-dev.

But tesseract is not showing jp2 support. What else do I need to do for it?


tesseract 4.0.0-beta.1-59-g2cc4
 leptonica-1.75.3
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0

 Found AVX
 Found SSE
Shreeshrii commented 6 years ago

@stweil Please see https://github.com/tesseract-ocr/tesseract/issues/1339#issuecomment-377206627

Is it possible to get different results from same traineddata and image?

stweil commented 6 years ago

I would not say no as I can imagine reasons why the same Tesseract version with same traineddata could give different results for the same image.

If we can confirm such differences, that is clearly something which needs to get fixed. Results must be reproducible.

Shreeshrii commented 6 years ago

If you have leptonica with jp2 support please try with the image linked in https://github.com/tesseract-ocr/tesseract/issues/1339#comment-377181900

And compare your result to

https://yadi.sk/i/bAykSKIW3TsjBS

I had converted the image to png, so it is not the exact same image, those results with deu from best and fast, as well as the image are also there in this thread.

stweil commented 6 years ago

png is lossless, so it should be the same image and make no difference in the OCR result. I'll try the example myself later.

AlexanderP commented 6 years ago

@Shreeshrii As I understand. Need openjpeg version 2.3 and higher

yurytch commented 6 years ago

Hey guys, the tesseract versions on MY side WERE different. I was following the initial advice by @Shreeshrii. The 1st text I posted here was generated in 2018-01-06_git tesseract, the 2nd one - in today's_git tesseract. Leptonica 1.74.1, same version in both cases.

yurytch commented 6 years ago

Oh, I see. I've posted only the results with today's git and JP2. I'm now posting the results for today's git and @Shreeshrii's PNG. https://yadi.sk/i/Xf0LOw2g3TtJwP

Shreeshrii commented 6 years ago

@yurytch please confirm which tessdata did you use? Tessdata_fast?

Also was it with default psm?

Shreeshrii commented 6 years ago

Alex,

Please see https://github.com/DanBloomberg/leptonica/blob/master/src/environ.h

/-------------------------------------------------------------------------

Though, setting have_libjp2k =1 in environmental.h did not work for me when I tried to build with it.

On Thu 29 Mar, 2018, 8:40 PM Alexander Pozdnyakov, notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii As I understand. Need openjpeg version 2.3 and higher

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1339#issuecomment-377267143, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-mynoctMSe2HRZ-uT4C5yp7_YZmks5tjPlcgaJpZM4SNpit .

Shreeshrii commented 6 years ago

Sorry, I see in the change log now

Modified jpeg2000 header to use openjpeg 2.3.

On Thu 29 Mar, 2018, 9:13 PM ShreeDevi Kumar, shreeshrii@gmail.com wrote:

Alex,

Please see https://github.com/DanBloomberg/leptonica/blob/master/src/environ.h

/-------------------------------------------------------------------------

  • Leptonica supports OpenJPEG 2.0+. If you have a version of openjpeg *
  • (HAVE_LIBJP2K == 1) that is >= 2.0, set the path to the openjpeg.h *
  • header in angle brackets here. -------------------------------------------------------------------------*/

    define LIBJP2K_HEADER <openjpeg-2.3/openjpeg.h>

Though, setting have_libjp2k =1 in environmental.h did not work for me when I tried to build with it.

On Thu 29 Mar, 2018, 8:40 PM Alexander Pozdnyakov, < notifications@github.com> wrote:

@Shreeshrii https://github.com/Shreeshrii As I understand. Need openjpeg version 2.3 and higher

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1339#issuecomment-377267143, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-mynoctMSe2HRZ-uT4C5yp7_YZmks5tjPlcgaJpZM4SNpit .

yurytch commented 6 years ago

@Shreeshrii yes, right, the default PSM and tessdata_best. I get poor-ish results with tessdata_fast, so don't even keep it on disk. Linux 64 bit, FWIW.

AlexanderP commented 6 years ago

@Shreeshrii I compiled the leptonica by means of cmake. jpeg2000 is not supported though in the log it gathers.

Shreeshrii commented 6 years ago

@AlexanderP Thank you for following up.

I went back to autotools because the cmake version was too slow on my pc (I run WSL on windows 10).

I built openjpeg from source and leptonica build was able to find it.

tesseract -v
tesseract 4.0.0-beta.1-64-gd284
 leptonica-1.76.0
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.3.0
 Found AVX
 Found SSE

I am finding quite a bit of difference between the recognized text on my PC vs the ones by @yurytch using the same traineddata and same images with same tesseract code. However, the hardware and o/s and leptonica version maybe different. Locale may also be different.

I am hoping that @stweil will be able to investigate and figure it out.

AlexanderP commented 6 years ago

whether there is a sense to add to ppa - openjpeg-2.3?

Shreeshrii commented 6 years ago

Yes, I think it will be helpful to add openjpeg-2.3 to PPA. Thanks!

Shreeshrii commented 6 years ago

Opened a new issue for the recent discussion on this thread.

Original issue of complete lines being dropped during recognition still exists.

stweil commented 6 years ago

png is lossless, so it should be the same image and make no difference in the OCR result. I'll try the example myself later.

Sorry that it took some time. Now I have done some tests with that images.

Both jp2 and png images don't include resolution information. That explains why earlier versions of Tesseract (which assumed 70 DPI before 2017-09-08) get other results than newer versions (which estimate a resolution of 179 DPI). Neither 70 DPI nor 179 DPI are correct for the test image, so I expect that the result could be better with the right resolution.

I get the same results from the original jp2 image and from a png image made from the jp2 by using convert. That confirms my earlier statement that the image format should not make a difference when both formats are lossless.

@Shreeshrii, your png image differs from mine:

$ ls -l *png
-rw-r--r-- 1 stweil stweil  812518 Apr 28 10:29 ZeitschriftFuerHistorischeWaffenkunde5_0122.jp2.png
-rw-r--r-- 1 stweil stweil 1576022 Apr 28 08:55 ZeitschriftFuerHistorischeWaffenkunde5_0122.png
$ file *png
ZeitschriftFuerHistorischeWaffenkunde5_0122.jp2.png: PNG image data, 1335 x 1602, 8-bit grayscale, non-interlaced
ZeitschriftFuerHistorischeWaffenkunde5_0122.png:     PNG image data, 1335 x 1602, 8-bit/color RGB, non-interlaced

I get the same results as you with your png image, but those results differ from the jp2 / grayscale png image results. So one interesting result is that obviously Tesseract gets different results from grayscale and color images, even when both look exactly the same. This needs more investigations.

@yurytch, I cannot reproduce your results. Could you try the latest Debian packages for tesseract-ocr? In my tests, those Debian packages and latest Git master show the same results.

Shreeshrii commented 6 years ago

@stweil Could different results be because of different versions of leptonica library?

https://github.com/tesseract-ocr/tesseract/issues/1339#issuecomment-377267899 @yurytch is using 1.74.1 , I am using more recent versions.

stweil commented 6 years ago

I used 1.75.3-4 from Debian testing, but can repeat the test with an older version.

yurytch commented 6 years ago

Answering several days' worth of msgs at once, the results of OCR'ing the initial JP2: https://yadi.sk/d/G2scDhj53TsU52

with fresh versions: Leptonica 1.75.3, Tesseract Open Source OCR Engine v4.0.0-beta.1-203-g45bb

are here: https://yadi.sk/i/7JBcJVzY3Uxw7L

There are no obvious dropouts here, however dropouts (1-3 letters at a time) still happen, with that version, too. Is it possible to make tesseract output some kind of placeholder or tag into OCR'ed text?

Same for the bogus letters introduced into the text, like in the example, 'güuülden' or 'fruünflich'. Could tesseract be made to output anything not recognised reliably enough, as some kind of 'empty glyph' or tag?

Format is irrelevant for the result, and DPI setting seems to be ignored by tesseract. I've put the density field into TIFF file by ImageMagick's 'convert', and it is there, verified by 'identify' tool, but tesseract still goes 'estimating the resolution'.

Shreeshrii commented 6 years ago

See report regarding missing line in Persian

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/dG3hcHQrlr8/cTksOnoxCgAJ

bhaveshvyas007 commented 4 years ago

This is very annoying... Irregularly for same image document format, few lines are missing (sometimes). Mine is simple English. Can anyone tell what is the cause of missing lines?

zdenop commented 4 years ago

@bhaveshvyas007: no we can not - we do not have crystal balls to determine your image, tesseract version, command you run, OS you use and other must information.

bhaveshvyas007 commented 4 years ago

@zdenop @amitdo I can't share image but below are the details: (Since this a open issue, I didn't worry about sharing the version info, sorry)

Ubuntu version : Ubuntu 18.04.3 LTS Tesseract Version

tesseract 4.1.0 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1

Command I run : tesseract public/uploads/1579701431490/sample.pdf.jpg public/uploads/1579701431490/result -l engfast --psm 6 hocr

Btw, I even tried -l eng, and psm 3 or 4 but same line is missing always.

Btw that missing line is just a address line with city, state and zip : image

image

bhaveshvyas007 commented 4 years ago

@Shreeshrii @zdenop

Samples : I have added few dummy samples containing the jpeg files and the hocr result here

Command : tesseract ./1582890068747/Barack,_Obama.pdf.jpg ./1582890068747/result -l engfast --psm 6 hocr

Tesseract Version Details : tesseract 4.1.0 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE

OS Version : Ubuntu 18.04.3 LTS

Issue: If you take a look into the Patient Address line, it missing from hocr output i.e image

sometimes the line which says Account #: is also missing. i.e: image

Btw these issues are happening only with few samples, it works fine for most of them.

What I tried I tried using language eng and eng-best but no success. I can't change the psm mode because --psm 6 is working fine for most of the samples, I don't want to change parser code.

Question : Can anyone figure it out why such big lines are totally missing from the output? Any solution?

kbrajwani commented 3 years ago

Hey , Guys Have you found any solution i am facing the same issues. System configurations. tesseract 5.0.0-alpha-20210401-71-g2be89 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE4.1 Found OpenMP 201511 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1

You can see the image attached below for reference. https://user-images.githubusercontent.com/29722986/116217982-b67c4680-a767-11eb-8005-f7bb0c8f55c3.png

Let's say if i am using psm 6 the first line date prepared is missing and you will see policy numbers are not correct. And if i will use psm 11 then the po box line is missing also there will be spaces in same word. like in AMERICAN AGENCY === ER ICAN AGENCY 94-23 JAMAICA === 94-23 JAMA ICA

Can you guys tell me how can i solve this issue.

amitdo commented 3 years ago

@stweil,

I get the same results as you with your png image, but those results differ from the jp2 / grayscale png image results. So one interesting result is that obviously Tesseract gets different results from grayscale and color images, even when both look exactly the same. This needs more investigations.

Most likely this is caused by Tesseract's otsu thresholding.

amitdo commented 3 years ago

@kbrajwani

Let's say if i am using psm 6

From the command line help:

6 Assume a single uniform block of text.

It does not make sense to use psm 6 on your image, which has multiple blocks. You misleading Tesseract.

Did you try to use psm 3?

kbrajwani commented 3 years ago

Hey @amitdo psm 3 works great. I didn't remember why i have changed default psm. Thanks

kbrajwani commented 3 years ago

https://user-images.githubusercontent.com/29722986/116717573-a6c66180-a9f6-11eb-85af-1d364de7e3ee.png Hey @amitdo please look into image there is lines are missing in psm 3. Possiblity of issue is lines are connected.

amitdo commented 3 years ago

Possiblity of issue is lines are connected.

You are right in your assumption. Tesseract's layout analysis can't cope with connected lines.

kbrajwani commented 3 years ago

Thanks for confirming.can you tell me there is any way we can handle this.

amitdo commented 3 years ago

If you can find another tool that will correctly segment the lines, you can then run Tesseract on each line.

kbrajwani commented 3 years ago

@amitdo hey can't we train tesseract to identify the lines bounding box. As we are giving line-level bounding box information at the time of training tesseract on own images. Thanks