Open theraysmith opened 7 years ago
@DanBloomberg
Can you suggest a way for improving text line finding?
ref: http://www.dicklyon.com/phototech/PhotoTech_11_DocImage_Slides.pdf
See https://github.com/tesseract-ocr/tesseract/files/696122/ara.TRAINING.zip for box/tiff pairs and https://drive.google.com/file/d/0B1JdJ8IXNweRX3NEMkZfX3gtdlk/view?usp=sharing for some sample image files for Arabic.
For Devanagari samples see https://github.com/tesseract-ocr/langdata/issues/40
Improve with respect to what? What is in leptonica? What is in tesseract?
Have you looked at these prog files: arab_lines.c livre_figures.c
Dan,
Thanks for your prompt response and links to the appropriate leptonica program files. https://github.com/DanBloomberg/leptonica/blob/master/prog/arabic_lines.c https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_pageseg.c I will take a look at those.
FYI, I am not a C programmer. I am interested in good open source OCR for Indian languages and am trying out/testing tesseract for that. I am looking for improvement in tesseract in correctly identifying the textlines for complex scripts such as devanagari etc so as to get a more accurate OCR at the end. I also tested recently for Arabic text with diacritics.
A search yesterday led to your presentation on the net. Since tesseract uses leptonica already, I thought that you might be able to suggest better ways of textline finding in tesseract, specially for Arabic Diacritics, Devanagari script etc. (I have edited the earlier post with links to some sample files).
I am building the leptonica programs now and will try out the arabic_lines program on your sample image as well as other samples provided by Arabic users of tesseract.
I tried arabic_lines with both arabic diacritics and devanagari sample and it is marking the texlines well. Results attached.
For reference, these are the two input images used with arabic_lines.
Not so good as you might think? Aren't the 3 yellow lines near the top and the 3 orange lines at the bottom supposed to be different colors? I think they have been fused into one line.
On Mon, Jan 16, 2017 at 9:20 PM, Shreeshrii notifications@github.com wrote:
I tried arabic_lines with both arabic diacritics and devanagari sample and it is marking the texlines well. Results attached. [image: result-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/22008592/a8101138-dca2-11e6-8d85-a0cbcc078304.png [image: result-deva] https://cloud.githubusercontent.com/assets/5095331/22008595/a814004a-dca2-11e6-99e6-26cedd4bc4e3.png [image: textlines-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/22008594/a8129854-dca2-11e6-848e-4378a24beb26.png [image: textlines-deva] https://cloud.githubusercontent.com/assets/5095331/22008593/a811e328-dca2-11e6-9f39-977c3942e622.png
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issuecomment-273025064, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Up0oVZPWQ2YcPpA-pm4Ju1d_k_lks5rTE-BgaJpZM4LiGKh .
-- Ray.
Yes, the colors tell you at a glance if you've broken or merged textlines.
Of the 4 bad merges that you show (6 lines into 2), all but one are trivially fixed with small changes in the morphology parameters. I'll do some experimenting.
-- Dan
On Mon, Jan 23, 2017 at 11:28 AM, theraysmith notifications@github.com wrote:
Not so good as you might think? Aren't the 3 yellow lines near the top and the 3 orange lines at the bottom supposed to be different colors? I think they have been fused into one line.
On Mon, Jan 16, 2017 at 9:20 PM, Shreeshrii notifications@github.com wrote:
I tried arabic_lines with both arabic diacritics and devanagari sample and it is marking the texlines well. Results attached. [image: result-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/22008592/a8101138- dca2-11e6-8d85-a0cbcc078304.png [image: result-deva] https://cloud.githubusercontent.com/assets/5095331/22008595/a814004a- dca2-11e6-99e6-26cedd4bc4e3.png [image: textlines-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/22008594/a8129854- dca2-11e6-848e-4378a24beb26.png [image: textlines-deva] https://cloud.githubusercontent.com/assets/5095331/22008593/a811e328- dca2-11e6-9f39-977c3942e622.png
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657# issuecomment-273025064, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Up0oVZPWQ2YcPpA- pm4Ju1d_k_lks5rTE-BgaJpZM4LiGKh .
-- Ray.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issuecomment-274591196, or mute the thread https://github.com/notifications/unsubscribe-auth/AP6mLNQvLtkSc1Z9yzubMWNO6hM6YwX8ks5rVP9dgaJpZM4LiGKh .
I've finished experimenting and will push some modified code to leptonica to make this a bit more robust.
Changes will be in both pixExtractTextlines() and the demonstration code in prog/arabic_lines.
-- Dan
On Mon, Jan 23, 2017 at 11:35 AM, Dan Bloomberg dan.bloomberg@gmail.com wrote:
Yes, the colors tell you at a glance if you've broken or merged textlines.
Of the 4 bad merges that you show (6 lines into 2), all but one are trivially fixed with small changes in the morphology parameters. I'll do some experimenting.
-- Dan
On Mon, Jan 23, 2017 at 11:28 AM, theraysmith notifications@github.com wrote:
Not so good as you might think? Aren't the 3 yellow lines near the top and the 3 orange lines at the bottom supposed to be different colors? I think they have been fused into one line.
On Mon, Jan 16, 2017 at 9:20 PM, Shreeshrii notifications@github.com wrote:
I tried arabic_lines with both arabic diacritics and devanagari sample and it is marking the texlines well. Results attached. [image: result-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/ 22008592/a8101138-dca2-11e6-8d85-a0cbcc078304.png [image: result-deva] https://cloud.githubusercontent.com/assets/5095331/ 22008595/a814004a-dca2-11e6-99e6-26cedd4bc4e3.png [image: textlines-arabic-diacritics] https://cloud.githubusercontent.com/assets/5095331/ 22008594/a8129854-dca2-11e6-848e-4378a24beb26.png [image: textlines-deva] https://cloud.githubusercontent.com/assets/5095331/ 22008593/a811e328-dca2-11e6-9f39-977c3942e622.png
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issue comment-273025064, or mute the thread https://github.com/notifications/unsubscribe-auth/ AL056Up0oVZPWQ2YcPpA-pm4Ju1d_k_lks5rTE-BgaJpZM4LiGKh .
-- Ray.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issuecomment-274591196, or mute the thread https://github.com/notifications/unsubscribe-auth/AP6mLNQvLtkSc1Z9yzubMWNO6hM6YwX8ks5rVP9dgaJpZM4LiGKh .
@theraysmith
Maybe ocropy's lines finding algorithm can help. AFAIK, it was designed to work well with Arabic. https://github.com/tmbdev/ocropy/blob/master/ocropus-gpageseg
See this remark: https://github.com/tmbdev/ocropy/issues/46#issuecomment-112153537
It should be given a block with uniform size font.
I would like to give these hints to the developers: In Arabic there are to kinds of diacritics 1- letter attached diacritics (dots like ب ت ج ث and أ آ ؤ) which stick to the letter and is mandatory
N.B. لَاْ إِلَهَ إٍلا الله note that this َ ِ are different vowels has same shape exactly but used differently e.g. أَ is pronounced a while ِأ pronounced e. one used above letter latter used below letter.
bottom line: vowel diacritics in Arabic should be recognized alone (e.g separate box) (but I am thinking how to distinguish between the above case if it is the same box!!!) because it can be on any letter and is limited ( ّ َ ً ُ ٌ ِ ٍ ْ ) special case also this ّ can be conjunct/combined with other vowel diacritics also ًّ ّْ
it is limited as entity but can be heavily repeated on letter because every letter has the capability to combined with
hope this could help Tesseract developers
Thanks for the information. Please take a look at the attached unicharset, and let me know if you see any deficiencies. I notice that ZWJ and ZWNJ are not there, but 202c(Pop directional formatting) is. It seems to contain all the diacritics that you mentioned.
On Fri, May 5, 2017 at 7:39 AM, bmwmy notifications@github.com wrote:
I would like to give these hints to the developers: In Arabic there are to kinds of diacritics 1- letter attached diacritics (dots like ب ت ج ث and أ آ ؤ) which stick to the letter and is mandatory
- Vowel diacritics like ( ْ ّ َ ً ِ ٍ ) used with letters any letter can be conjunct/combined with it and is optional. Kids learn it to read properly as it help get rid of ambiguity, because عَلم and عِلم are two different words but we use the context to distinguish when vowel diacritics are absent.
N.B. لَاْ إِلَهَ إٍلا الله note that this َ ِ are different vowels same shape exactly but used differently e.g. أَ is pronounced a while ِأ pronounced e. one used above letter latter used below letter.
bottom line: vowel diacritics in Arabic should be recognized alone (e.g separate box) because it can be on any letter and is limited ( ّ َ ً ُ ٌ ِ ٍ ْ ) special case also this ّ can be conjunct/combined with other vowel diacritics also ًّ ّْ
hope this could help Tesseract developers
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/657#issuecomment-299482582, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056byXKSTYnwgO3_kA5fGf_x4WGKp8ks5r2zSLgaJpZM4LiGKh .
-- Ray.
@theraysmith
Please take a look at the attached unicharset,
No file was attached.
Please also see https://github.com/tesseract-ocr/tesseract/issues/894#issue-226872462
@zdenop,
Please label it as 'layout-analysis'.
Hi All, Can someone please tell me an automated way to generate the box files for training for the Arabic language if I already have the .gt.txt ground truth files as well as the tiff files?
Thanks in advance
@agorararmard You can try the wordstrbox
option to create line boxes. For single page tiffs, I use sed processing as follows to update the box file generated by tesseract with the ground truth. I have tested for English and Devanagari script.
my_files=$(ls ./eng.test*.tif)
for my_file in ${my_files}; do
echo -e "\n${my_file%.*}"
OMP_THREAD_LIMIT=1 tesseract $my_file ${my_file%.*} -l eng --psm 6 --tessdata-dir ~/tesseract/tessdata --oem 1 wordstrbox
mv "${my_file%.*}.box" "${my_file%.*}.wordstrbox"
sed -i -e "s/ \#.*/ \#/g" ${my_file%.*}.wordstrbox
sed -e '/^$/d' ${my_file%.*}.gt.txt > tmp.txt
sed -e 's/$/\n/g' tmp.txt > ${my_file%.*}.gt.txt
paste --delimiters="\0" ${my_file%.*}.wordstrbox ${my_file%.*}.gt.txt > ${my_file%.*}.box
rm ${my_file%.*}.wordstrbox
OMP_THREAD_LIMIT=1 tesseract $my_file ${my_file%.*} -l eng --psm 6 --tessdata-dir ~/tesseract/tessdata --oem 1 lstm.train
done
ls -1 ./*.lstmf > ./eng.training_files.txt
@Shreeshrii can you please tell me the requirements for running this script because I'm not very familiar with Tesseract and scripts alike. Thank you very much.
@agorararmard Please see https://github.com/tesseract-ocr/tesseract/issues/2082 and other issues related to Arabic. I am not sure how much success people have had in training Arabic.
I am attaching a zip file with sample images (taken from earlier issues posted in the repo) and ground truth (for some). You can correct the ground truth for the two TOC images and then run the script. It will create the box files in wordstrbox format which can be used for LSTM training.
you need to keep your image and ground truth files in similar format and then run script.
@amitdo I worked a bit on this about 3 years ago. Made a new function pixExtractRawTextlines(), and claimed pixExtractTextlines() is now more robust. But I don't remember the details.
Dan,
Yes, I know about your Leptonica code with improved Arabic handling.
I found an old code in Tesseract that claims to handle Arabic, so I linked to that code.
It was part of 'Cube' ocr engine that was used in the past but was removed after the LSTM+CNN ocr engine was added.
Dan, it was not my intention to 'call' you and grab your attention. You were alerted by GitHub because you participated in this issue discussion in the past.
No problem at all. Glad to be reminded that I'd improved the algorithm back in 2017 :-)
Diacritics often get separated into their own text lines.