Open zc813 opened 6 years ago
Supplement: On the first image, the 2nd line remains ignored even if I masked the 1st or the 3rd line. (not cropped nor resized)
གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་
I think the page segmentation is not working because the text lines are too close and the diacritics are merging with previous/next line.
There is a config variable which can be tried for this -
# extra space to allow for diacritics above and below the characters
textord_min_linesize 2.5
It works well with your cropped first image - I think the slight white border around the image helps too.
I used the following command:
tesseract Tibetan1.png Tibetan -l bod --psm 6 -c textord_min_linesize=2.5
Here is the output:
༄༅། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ
གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་
པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུྱེ་ཤྲིའི་གསུང་ལྡེབ།༡ སློབ་དཔོན་ཆེན་ཕོ་རྡོ་རྗེ་གདན་པ་ཇོ་བོ་པུྱེ་ཤྲིས་མཛད་པའི་དགེ་བསྟེན་སྡོམ་པའི་རྣམ་པར་བཞག་པ་ཚིགས་སུ་
བཅད་པ་བརྒྱད་པ་འཆད་པ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཁྱིམ་པ་རྗེས་སུ་འཛིན་པ་སོ་སོར་ཐར་པའི་སྡོམ་པའི་ཆོ་ག་ཐར་ལམ་རབ་གསལ་འགྱུར་མེད་རྡོ་རྗེའི་གསུང་ལྡེབ།༡ ༢ འཕགས་པ་
གཞི་ཐམས་ཅད་ཡོད་པར་སྨྲ་བའི་དགེ་ཚུལ་གྱི་ཚིག་ལེའུར་བྱས་པ་ལྡེབ།° དགེ་སློང་ཕའི་སོ་སོར་ཐར་པའི་མདོ་རྩ་བ་ལྡེབ།༣༦ རབ་བྱུང་གི་གཞིའི་ཆོ་ག་རིན་ཆེན་ཐེམ་སྐས་དྷརྨ་ཤྲིའི་
གསུང་ལྡེབ།༤༣ ལོ་མིང་རེའུ་མིག་ལྡེབ།༡ སྡོམ་པ་འབུལ་ཆོག་ལྡེབ།༢ བསླབ་པ་ཡོངས་སུ་སྦྱོང་བ་གཞི་གསུམ་གྱི་ཆོ་ག་ཐར་གླིང་དུ་བགྲོད་པའི་གྲུ་ཆེན་དུས་བརྗོད་རེའུ་མིག་བཅས་
If most text is like this, it should be added to bod.config file, otherwise just use config variable as part of command.
Hi, @Shreeshrii Thanks for your kind reply! I tried your solution. Actually, the cropped picture worked even without this configuration.
When using the uncropped picture, setting this config variable worked only when textord_min_linesize
is exactly 0.82. Neither 0.81 or 0.83 works. And this value depends on the picture.
Do you have any idea? Thanks very much!
For this picture, the textord_min_linesize
has to be set to a number between 0.96 and 0.99. Neither smaller or greater values work:
1_page_075
Greater values cause incomplete results, while smaller values lead to wrong recognition.
Again, thanks a lot!
And this value depends on the picture. Do you have any idea?
Sorry. Don't know how the page layout analysis works.
@zdenop Label with
4.0x Accuracy
I don't know if this is the right place, but I get missing words, and even several words at once, in the German text, processing with tessdata_best. I can provide the scan in question if necessary, it's in public domain.
@yurytch Yes, please provide the image so that we can test with the latest version.
Fine, only I can't find where do I attach the files here. So I've put the image and text OCR'ed from it on the cloud. The tesseract was built from source from git checkout 2018-01-06, used with tessdata_best. The '***' in the .TXT were added by hand, to mark where the letters or complete words were dropped out without any indication from tesseract. https://yadi.sk/d/G2scDhj53TsU52 https://yadi.sk/i/JTh7Ixnv3TsTxY
Please try with the latest commit.
@yurytch The image is 6MB+ jp2 file, yet the clarity in image is not there. I converted to png for testing, since I havent built leptonica with jp2 support.
@amitdo Tried with latest commit from yesterday. OCRed files attached.
ZeitschriftFuerHistorischeWaffenkunde5_0122-tessdata_best-deu-1.txt
ZeitschriftFuerHistorischeWaffenkunde5_0122-tessdata_fast-deu-1.txt
@Shreeshrii Yes, thank you very much. Have completed the test run with the today's git C/O right now (with the JP2-enabled leptonica, as before). Those drop-outs are gone now. 'My' results are different from 'yours' (was that to be expected?), not always to the good. For the reference: https://yadi.sk/i/bAykSKIW3TsjBS
'My' results are different from 'yours' (was that to be expected?), not always to the good.
Possible, because I used a different version of image. Though there are too many differences.
I will install jp2 library and try again.
@yurytch I am attaching the png version that I used.
I will install jp2 library and try again.
Not successful in building leptonica with jp2. So trying to install from ppa ...
@AlexanderP Your ppa has both openjpeg2 and leptonlib. I installed liblept and libleptonica-dev from there as well as libopenjp2-7 libopenjp2-7-dev.
But tesseract is not showing jp2 support. What else do I need to do for it?
tesseract 4.0.0-beta.1-59-g2cc4
leptonica-1.75.3
libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0
Found AVX
Found SSE
@stweil Please see https://github.com/tesseract-ocr/tesseract/issues/1339#issuecomment-377206627
Is it possible to get different results from same traineddata and image?
I would not say no as I can imagine reasons why the same Tesseract version with same traineddata could give different results for the same image.
If we can confirm such differences, that is clearly something which needs to get fixed. Results must be reproducible.
If you have leptonica with jp2 support please try with the image linked in https://github.com/tesseract-ocr/tesseract/issues/1339#comment-377181900
And compare your result to
https://yadi.sk/i/bAykSKIW3TsjBS
I had converted the image to png, so it is not the exact same image, those results with deu from best and fast, as well as the image are also there in this thread.
png is lossless, so it should be the same image and make no difference in the OCR result. I'll try the example myself later.
@Shreeshrii As I understand. Need openjpeg version 2.3 and higher
Hey guys, the tesseract versions on MY side WERE different. I was following the initial advice by @Shreeshrii. The 1st text I posted here was generated in 2018-01-06_git tesseract, the 2nd one - in today's_git tesseract. Leptonica 1.74.1, same version in both cases.
Oh, I see. I've posted only the results with today's git and JP2. I'm now posting the results for today's git and @Shreeshrii's PNG. https://yadi.sk/i/Xf0LOw2g3TtJwP
@yurytch please confirm which tessdata did you use? Tessdata_fast?
Also was it with default psm?
Alex,
Please see https://github.com/DanBloomberg/leptonica/blob/master/src/environ.h
/-------------------------------------------------------------------------
Though, setting have_libjp2k =1 in environmental.h did not work for me when I tried to build with it.
On Thu 29 Mar, 2018, 8:40 PM Alexander Pozdnyakov, notifications@github.com wrote:
@Shreeshrii https://github.com/Shreeshrii As I understand. Need openjpeg version 2.3 and higher
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1339#issuecomment-377267143, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-mynoctMSe2HRZ-uT4C5yp7_YZmks5tjPlcgaJpZM4SNpit .
Sorry, I see in the change log now
Modified jpeg2000 header to use openjpeg 2.3.
On Thu 29 Mar, 2018, 9:13 PM ShreeDevi Kumar, shreeshrii@gmail.com wrote:
Alex,
Please see https://github.com/DanBloomberg/leptonica/blob/master/src/environ.h
/-------------------------------------------------------------------------
- Leptonica supports OpenJPEG 2.0+. If you have a version of openjpeg *
- (HAVE_LIBJP2K == 1) that is >= 2.0, set the path to the openjpeg.h *
- header in angle brackets here. -------------------------------------------------------------------------*/
define LIBJP2K_HEADER <openjpeg-2.3/openjpeg.h>
Though, setting have_libjp2k =1 in environmental.h did not work for me when I tried to build with it.
On Thu 29 Mar, 2018, 8:40 PM Alexander Pozdnyakov, < notifications@github.com> wrote:
@Shreeshrii https://github.com/Shreeshrii As I understand. Need openjpeg version 2.3 and higher
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1339#issuecomment-377267143, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-mynoctMSe2HRZ-uT4C5yp7_YZmks5tjPlcgaJpZM4SNpit .
@Shreeshrii yes, right, the default PSM and tessdata_best. I get poor-ish results with tessdata_fast, so don't even keep it on disk. Linux 64 bit, FWIW.
@Shreeshrii I compiled the leptonica by means of cmake. jpeg2000 is not supported though in the log it gathers.
@AlexanderP Thank you for following up.
I went back to autotools because the cmake version was too slow on my pc (I run WSL on windows 10).
I built openjpeg from source and leptonica build was able to find it.
tesseract -v
tesseract 4.0.0-beta.1-64-gd284
leptonica-1.76.0
libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.3.0
Found AVX
Found SSE
I am finding quite a bit of difference between the recognized text on my PC vs the ones by @yurytch using the same traineddata and same images with same tesseract code. However, the hardware and o/s and leptonica version maybe different. Locale may also be different.
I am hoping that @stweil will be able to investigate and figure it out.
whether there is a sense to add to ppa - openjpeg-2.3?
Yes, I think it will be helpful to add openjpeg-2.3 to PPA. Thanks!
Opened a new issue for the recent discussion on this thread.
Original issue of complete lines being dropped during recognition still exists.
png is lossless, so it should be the same image and make no difference in the OCR result. I'll try the example myself later.
Sorry that it took some time. Now I have done some tests with that images.
Both jp2 and png images don't include resolution information. That explains why earlier versions of Tesseract (which assumed 70 DPI before 2017-09-08) get other results than newer versions (which estimate a resolution of 179 DPI). Neither 70 DPI nor 179 DPI are correct for the test image, so I expect that the result could be better with the right resolution.
I get the same results from the original jp2 image and from a png image made from the jp2 by using convert. That confirms my earlier statement that the image format should not make a difference when both formats are lossless.
@Shreeshrii, your png image differs from mine:
$ ls -l *png
-rw-r--r-- 1 stweil stweil 812518 Apr 28 10:29 ZeitschriftFuerHistorischeWaffenkunde5_0122.jp2.png
-rw-r--r-- 1 stweil stweil 1576022 Apr 28 08:55 ZeitschriftFuerHistorischeWaffenkunde5_0122.png
$ file *png
ZeitschriftFuerHistorischeWaffenkunde5_0122.jp2.png: PNG image data, 1335 x 1602, 8-bit grayscale, non-interlaced
ZeitschriftFuerHistorischeWaffenkunde5_0122.png: PNG image data, 1335 x 1602, 8-bit/color RGB, non-interlaced
I get the same results as you with your png image, but those results differ from the jp2 / grayscale png image results. So one interesting result is that obviously Tesseract gets different results from grayscale and color images, even when both look exactly the same. This needs more investigations.
@yurytch, I cannot reproduce your results. Could you try the latest Debian packages for tesseract-ocr? In my tests, those Debian packages and latest Git master show the same results.
@stweil Could different results be because of different versions of leptonica library?
https://github.com/tesseract-ocr/tesseract/issues/1339#issuecomment-377267899 @yurytch is using 1.74.1 , I am using more recent versions.
I used 1.75.3-4 from Debian testing, but can repeat the test with an older version.
Answering several days' worth of msgs at once, the results of OCR'ing the initial JP2: https://yadi.sk/d/G2scDhj53TsU52
with fresh versions: Leptonica 1.75.3, Tesseract Open Source OCR Engine v4.0.0-beta.1-203-g45bb
are here: https://yadi.sk/i/7JBcJVzY3Uxw7L
There are no obvious dropouts here, however dropouts (1-3 letters at a time) still happen, with that version, too. Is it possible to make tesseract output some kind of placeholder or tag into OCR'ed text?
Same for the bogus letters introduced into the text, like in the example, 'güuülden' or 'fruünflich'. Could tesseract be made to output anything not recognised reliably enough, as some kind of 'empty glyph' or tag?
Format is irrelevant for the result, and DPI setting seems to be ignored by tesseract. I've put the density field into TIFF file by ImageMagick's 'convert', and it is there, verified by 'identify' tool, but tesseract still goes 'estimating the resolution'.
See report regarding missing line in Persian
This is very annoying... Irregularly for same image document format, few lines are missing (sometimes). Mine is simple English. Can anyone tell what is the cause of missing lines?
@bhaveshvyas007: no we can not - we do not have crystal balls to determine your image, tesseract version, command you run, OS you use and other must information.
@zdenop @amitdo I can't share image but below are the details: (Since this a open issue, I didn't worry about sharing the version info, sorry)
Ubuntu version : Ubuntu 18.04.3 LTS Tesseract Version
tesseract 4.1.0 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Command I run : tesseract public/uploads/1579701431490/sample.pdf.jpg public/uploads/1579701431490/result -l engfast --psm 6 hocr
Btw, I even tried -l eng, and psm 3 or 4 but same line is missing always.
Btw that missing line is just a address line with city, state and zip :
@Shreeshrii @zdenop
Samples : I have added few dummy samples containing the jpeg files and the hocr result here
Command : tesseract ./1582890068747/Barack,_Obama.pdf.jpg ./1582890068747/result -l engfast --psm 6 hocr
Tesseract Version Details : tesseract 4.1.0 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE
OS Version : Ubuntu 18.04.3 LTS
Issue: If you take a look into the Patient Address line, it missing from hocr output i.e
sometimes the line which says Account #:
Btw these issues are happening only with few samples, it works fine for most of them.
What I tried I tried using language eng and eng-best but no success. I can't change the psm mode because --psm 6 is working fine for most of the samples, I don't want to change parser code.
Question : Can anyone figure it out why such big lines are totally missing from the output? Any solution?
Hey , Guys Have you found any solution i am facing the same issues. System configurations. tesseract 5.0.0-alpha-20210401-71-g2be89 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE4.1 Found OpenMP 201511 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
You can see the image attached below for reference. https://user-images.githubusercontent.com/29722986/116217982-b67c4680-a767-11eb-8005-f7bb0c8f55c3.png
Let's say if i am using psm 6 the first line date prepared is missing and you will see policy numbers are not correct. And if i will use psm 11 then the po box line is missing also there will be spaces in same word. like in AMERICAN AGENCY === ER ICAN AGENCY 94-23 JAMAICA === 94-23 JAMA ICA
Can you guys tell me how can i solve this issue.
@stweil,
I get the same results as you with your png image, but those results differ from the jp2 / grayscale png image results. So one interesting result is that obviously Tesseract gets different results from grayscale and color images, even when both look exactly the same. This needs more investigations.
Most likely this is caused by Tesseract's otsu thresholding.
@kbrajwani
Let's say if i am using psm 6
From the command line help:
6 Assume a single uniform block of text.
It does not make sense to use psm 6 on your image, which has multiple blocks. You misleading Tesseract.
Did you try to use psm 3?
Hey @amitdo psm 3 works great. I didn't remember why i have changed default psm. Thanks
https://user-images.githubusercontent.com/29722986/116717573-a6c66180-a9f6-11eb-85af-1d364de7e3ee.png Hey @amitdo please look into image there is lines are missing in psm 3. Possiblity of issue is lines are connected.
Possiblity of issue is lines are connected.
You are right in your assumption. Tesseract's layout analysis can't cope with connected lines.
Thanks for confirming.can you tell me there is any way we can handle this.
If you can find another tool that will correctly segment the lines, you can then run Tesseract on each line.
@amitdo hey can't we train tesseract to identify the lines bounding box. As we are giving line-level bounding box information at the time of training tesseract on own images. Thanks
Environment
Current Behavior:
Brief description:
Test image:
https://user-images.githubusercontent.com/15245190/36480676-2820ca12-1748-11e8-9964-7c45a86426a5.png
Recognized with tessdata_best/bod.traineddata. First 3 lines:
PSM==6 01 ༄༅། །ཕམ་གྱི་གསུང་ལྡེབ། ༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ། ༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ། ༢ ༈ 02 (2nd line missing) 03 (3rd line missing)
PSM==11 All lines are complete but some are shattered and more inaccurate.
PSM==3 01 ༄༅། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ལམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ 02 (2nd line missing) 03 པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུངྱེ་ཤྲཱིའི་གསུང་ལྡེབ།༡ སློབ་དཔོན་ཆེན་པོ་རྡོ་རྗེ་གདན་པ་ཇོ་བོ་པུའངྱེ་ཤྲིས་མཛད་པའི་དགེ་བསྟེན་སྡོམ་པའི་རྣམ་པར་བཞག་པ་ཚིགས་སུ་
PSM==3, same image but slightly rotated and cropped https://user-images.githubusercontent.com/15245190/36482692-13cdb550-174f-11e8-9378-b8617342594c.png
01 ༄༅།། །ཕམ་གྱི་གསུང་ལྡེབ།༢ མཁན་ཆེན་བསྟོད་པ་འཇམ་དཔལ་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ ཆོས་རྒྱུལ་ཆེན་པོའི་བསྟོད་པ་གངས་ཅན་མ་མི་ཕམ་གྱི་གསུང་ལྡེབ།༢ རྣ 02 གཉིས་པ་རྒྱུ་མཚན་ཉིད་ཀྱི་ཐེག་པར་ལམ་གྱི་གཞི་མ་སོ་ཐར་སྐོར་ལ། སོ་ཐར་སྡོམ་བརྒྱུད་གསོལ་འདེབས་ཀུན་མཁྱེན་ལྔ་པ་ཆེན་པོའི་གསུང་ལྡེབ།༣ དགེ་བསྟེན་གྱི་སྡོམ་པའི་རྣམ་ 03 པར་བཞག་པ་ཚིགས་སུ་བཅད་པ་སློབ་དཔོན་པུྱེ་ཤྲིའི་གསུང་ལྡེབ།༡ Another test image with its fourth line missing: https://user-images.githubusercontent.com/15245190/36481051-87d49898-1749-11e8-9fb0-cfa4334d2445.png
Do you have any idea? or any suggestion what I should do? Thanks a lot! @Shreeshrii @amitdo