tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.08k stars 9.5k forks source link

Improve textline finding for Arabic and other languages with many diacritics #657

Open theraysmith opened 7 years ago

theraysmith commented 7 years ago

Diacritics often get separated into their own text lines.

Shreeshrii commented 7 years ago

@DanBloomberg

Can you suggest a way for improving text line finding?

ref: http://www.dicklyon.com/phototech/PhotoTech_11_DocImage_Slides.pdf

See https://github.com/tesseract-ocr/tesseract/files/696122/ara.TRAINING.zip for box/tiff pairs and https://drive.google.com/file/d/0B1JdJ8IXNweRX3NEMkZfX3gtdlk/view?usp=sharing for some sample image files for Arabic.

For Devanagari samples see https://github.com/tesseract-ocr/langdata/issues/40

DanBloomberg commented 7 years ago

Improve with respect to what? What is in leptonica? What is in tesseract?

Have you looked at these prog files: arab_lines.c livre_figures.c

Shreeshrii commented 7 years ago

Dan,

Thanks for your prompt response and links to the appropriate leptonica program files. https://github.com/DanBloomberg/leptonica/blob/master/prog/arabic_lines.c https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_pageseg.c I will take a look at those.

FYI, I am not a C programmer. I am interested in good open source OCR for Indian languages and am trying out/testing tesseract for that. I am looking for improvement in tesseract in correctly identifying the textlines for complex scripts such as devanagari etc so as to get a more accurate OCR at the end. I also tested recently for Arabic text with diacritics.

A search yesterday led to your presentation on the net. Since tesseract uses leptonica already, I thought that you might be able to suggest better ways of textline finding in tesseract, specially for Arabic Diacritics, Devanagari script etc. (I have edited the earlier post with links to some sample files).

I am building the leptonica programs now and will try out the arabic_lines program on your sample image as well as other samples provided by Arabic users of tesseract.

Shreeshrii commented 7 years ago

I tried arabic_lines with both arabic diacritics and devanagari sample and it is marking the texlines well. Results attached. result-arabic-diacritics result-deva textlines-arabic-diacritics textlines-deva

Shreeshrii commented 7 years ago

For reference, these are the two input images used with arabic_lines.

arabic0 arabic-deva1

theraysmith commented 7 years ago

Not so good as you might think? Aren't the 3 yellow lines near the top and the 3 orange lines at the bottom supposed to be different colors? I think they have been fused into one line.

On Mon, Jan 16, 2017 at 9:20 PM, Shreeshrii notifications@github.com wrote:

I tried arabic_lines with both arabic diacritics and devanagari sample and it is marking the texlines well. Results attached. [image: result-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/22008592/a8101138-dca2-11e6-8d85-a0cbcc078304.png [image: result-deva] https://cloud.githubusercontent.com/assets/5095331/22008595/a814004a-dca2-11e6-99e6-26cedd4bc4e3.png [image: textlines-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/22008594/a8129854-dca2-11e6-848e-4378a24beb26.png [image: textlines-deva] https://cloud.githubusercontent.com/assets/5095331/22008593/a811e328-dca2-11e6-9f39-977c3942e622.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issuecomment-273025064, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Up0oVZPWQ2YcPpA-pm4Ju1d_k_lks5rTE-BgaJpZM4LiGKh .

-- Ray.

DanBloomberg commented 7 years ago

Yes, the colors tell you at a glance if you've broken or merged textlines.

Of the 4 bad merges that you show (6 lines into 2), all but one are trivially fixed with small changes in the morphology parameters. I'll do some experimenting.

-- Dan

On Mon, Jan 23, 2017 at 11:28 AM, theraysmith notifications@github.com wrote:

Not so good as you might think? Aren't the 3 yellow lines near the top and the 3 orange lines at the bottom supposed to be different colors? I think they have been fused into one line.

On Mon, Jan 16, 2017 at 9:20 PM, Shreeshrii notifications@github.com wrote:

I tried arabic_lines with both arabic diacritics and devanagari sample and it is marking the texlines well. Results attached. [image: result-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/22008592/a8101138- dca2-11e6-8d85-a0cbcc078304.png [image: result-deva] https://cloud.githubusercontent.com/assets/5095331/22008595/a814004a- dca2-11e6-99e6-26cedd4bc4e3.png [image: textlines-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/22008594/a8129854- dca2-11e6-848e-4378a24beb26.png [image: textlines-deva] https://cloud.githubusercontent.com/assets/5095331/22008593/a811e328- dca2-11e6-9f39-977c3942e622.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657# issuecomment-273025064, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Up0oVZPWQ2YcPpA- pm4Ju1d_k_lks5rTE-BgaJpZM4LiGKh .

-- Ray.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issuecomment-274591196, or mute the thread https://github.com/notifications/unsubscribe-auth/AP6mLNQvLtkSc1Z9yzubMWNO6hM6YwX8ks5rVP9dgaJpZM4LiGKh .

DanBloomberg commented 7 years ago

I've finished experimenting and will push some modified code to leptonica to make this a bit more robust.

Changes will be in both pixExtractTextlines() and the demonstration code in prog/arabic_lines.

-- Dan

On Mon, Jan 23, 2017 at 11:35 AM, Dan Bloomberg dan.bloomberg@gmail.com wrote:

Yes, the colors tell you at a glance if you've broken or merged textlines.

Of the 4 bad merges that you show (6 lines into 2), all but one are trivially fixed with small changes in the morphology parameters. I'll do some experimenting.

-- Dan

On Mon, Jan 23, 2017 at 11:28 AM, theraysmith notifications@github.com wrote:

Not so good as you might think? Aren't the 3 yellow lines near the top and the 3 orange lines at the bottom supposed to be different colors? I think they have been fused into one line.

On Mon, Jan 16, 2017 at 9:20 PM, Shreeshrii notifications@github.com wrote:

I tried arabic_lines with both arabic diacritics and devanagari sample and it is marking the texlines well. Results attached. [image: result-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/ 22008592/a8101138-dca2-11e6-8d85-a0cbcc078304.png [image: result-deva] https://cloud.githubusercontent.com/assets/5095331/ 22008595/a814004a-dca2-11e6-99e6-26cedd4bc4e3.png [image: textlines-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/ 22008594/a8129854-dca2-11e6-848e-4378a24beb26.png [image: textlines-deva] https://cloud.githubusercontent.com/assets/5095331/ 22008593/a811e328-dca2-11e6-9f39-977c3942e622.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issue comment-273025064, or mute the thread https://github.com/notifications/unsubscribe-auth/ AL056Up0oVZPWQ2YcPpA-pm4Ju1d_k_lks5rTE-BgaJpZM4LiGKh .

-- Ray.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issuecomment-274591196, or mute the thread https://github.com/notifications/unsubscribe-auth/AP6mLNQvLtkSc1Z9yzubMWNO6hM6YwX8ks5rVP9dgaJpZM4LiGKh .

amitdo commented 7 years ago

@theraysmith

Maybe ocropy's lines finding algorithm can help. AFAIK, it was designed to work well with Arabic. https://github.com/tmbdev/ocropy/blob/master/ocropus-gpageseg

See this remark: https://github.com/tmbdev/ocropy/issues/46#issuecomment-112153537

It should be given a block with uniform size font.

bmwmy commented 7 years ago

I would like to give these hints to the developers: In Arabic there are to kinds of diacritics 1- letter attached diacritics (dots like ب ت ج ث and أ آ ؤ) which stick to the letter and is mandatory

  1. Vowel diacritics like ( ْ ّ َ ً ِ ٍ ) used with letters any letter can be conjunct/combined with it and is optional. Kids learn it to read properly as it help get rid of ambiguity, because عَلم and عِلم are two different words but we use the context to distinguish when vowel diacritics are absent.

N.B. لَاْ إِلَهَ إٍلا الله note that this َ ِ are different vowels has same shape exactly but used differently e.g. أَ is pronounced a while ِأ pronounced e. one used above letter latter used below letter.

bottom line: vowel diacritics in Arabic should be recognized alone (e.g separate box) (but I am thinking how to distinguish between the above case if it is the same box!!!) because it can be on any letter and is limited ( ّ َ ً ُ ٌ ِ ٍ ْ ) special case also this ّ can be conjunct/combined with other vowel diacritics also ًّ ّْ

it is limited as entity but can be heavily repeated on letter because every letter has the capability to combined with

hope this could help Tesseract developers

theraysmith commented 7 years ago

Thanks for the information. Please take a look at the attached unicharset, and let me know if you see any deficiencies. I notice that ZWJ and ZWNJ are not there, but 202c(Pop directional formatting) is. It seems to contain all the diacritics that you mentioned.

On Fri, May 5, 2017 at 7:39 AM, bmwmy notifications@github.com wrote:

I would like to give these hints to the developers: In Arabic there are to kinds of diacritics 1- letter attached diacritics (dots like ب ت ج ث and أ آ ؤ) which stick to the letter and is mandatory

  1. Vowel diacritics like ( ْ ّ َ ً ِ ٍ ) used with letters any letter can be conjunct/combined with it and is optional. Kids learn it to read properly as it help get rid of ambiguity, because عَلم and عِلم are two different words but we use the context to distinguish when vowel diacritics are absent.

N.B. لَاْ إِلَهَ إٍلا الله note that this َ ِ are different vowels same shape exactly but used differently e.g. أَ is pronounced a while ِأ pronounced e. one used above letter latter used below letter.

bottom line: vowel diacritics in Arabic should be recognized alone (e.g separate box) because it can be on any letter and is limited ( ّ َ ً ُ ٌ ِ ٍ ْ ) special case also this ّ can be conjunct/combined with other vowel diacritics also ًّ ّْ

hope this could help Tesseract developers

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issuecomment-299482582, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056byXKSTYnwgO3_kA5fGf_x4WGKp8ks5r2zSLgaJpZM4LiGKh .

-- Ray.

Shreeshrii commented 7 years ago

@theraysmith

Please take a look at the attached unicharset,

No file was attached.

Please also see https://github.com/tesseract-ocr/tesseract/issues/894#issue-226872462

amitdo commented 6 years ago

@zdenop,

Please label it as 'layout-analysis'.

agorararmard commented 5 years ago

Hi All, Can someone please tell me an automated way to generate the box files for training for the Arabic language if I already have the .gt.txt ground truth files as well as the tiff files?

Thanks in advance

Shreeshrii commented 5 years ago

@agorararmard You can try the wordstrbox option to create line boxes. For single page tiffs, I use sed processing as follows to update the box file generated by tesseract with the ground truth. I have tested for English and Devanagari script.


my_files=$(ls ./eng.test*.tif)
for my_file in ${my_files}; do
    echo -e "\n${my_file%.*}"
    OMP_THREAD_LIMIT=1 tesseract $my_file        ${my_file%.*} -l eng --psm 6 --tessdata-dir ~/tesseract/tessdata --oem 1 wordstrbox
    mv "${my_file%.*}.box" "${my_file%.*}.wordstrbox" 
    sed -i -e "s/ \#.*/ \#/g"  ${my_file%.*}.wordstrbox
    sed -e '/^$/d' ${my_file%.*}.gt.txt > tmp.txt
    sed -e  's/$/\n/g' tmp.txt > ${my_file%.*}.gt.txt
    paste --delimiters="\0"  ${my_file%.*}.wordstrbox  ${my_file%.*}.gt.txt > ${my_file%.*}.box
    rm ${my_file%.*}.wordstrbox
    OMP_THREAD_LIMIT=1 tesseract $my_file ${my_file%.*} -l eng --psm 6  --tessdata-dir ~/tesseract/tessdata --oem 1  lstm.train
 done
ls -1 ./*.lstmf > ./eng.training_files.txt
agorararmard commented 5 years ago

@Shreeshrii can you please tell me the requirements for running this script because I'm not very familiar with Tesseract and scripts alike. Thank you very much.

Shreeshrii commented 5 years ago

@agorararmard Please see https://github.com/tesseract-ocr/tesseract/issues/2082 and other issues related to Arabic. I am not sure how much success people have had in training Arabic.

I am attaching a zip file with sample images (taken from earlier issues posted in the repo) and ground truth (for some). You can correct the ground truth for the two TOC images and then run the script. It will create the box files in wordstrbox format which can be used for LSTM training.

arabic.zip

you need to keep your image and ground truth files in similar format and then run script.

amitdo commented 4 years ago

https://github.com/tesseract-ocr/tesseract/blob/9234b4c62db34b/cube/cube_line_segmenter.cpp

DanBloomberg commented 4 years ago

@amitdo I worked a bit on this about 3 years ago. Made a new function pixExtractRawTextlines(), and claimed pixExtractTextlines() is now more robust. But I don't remember the details.

amitdo commented 4 years ago

Dan,

Yes, I know about your Leptonica code with improved Arabic handling.

I found an old code in Tesseract that claims to handle Arabic, so I linked to that code.

It was part of 'Cube' ocr engine that was used in the past but was removed after the LSTM+CNN ocr engine was added.

amitdo commented 4 years ago

Dan, it was not my intention to 'call' you and grab your attention. You were alerted by GitHub because you participated in this issue discussion in the past.

DanBloomberg commented 4 years ago

No problem at all. Glad to be reminded that I'd improved the algorithm back in 2017 :-)