tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.3k stars 9.52k forks source link

LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

Closed Shreeshrii closed 7 years ago

Shreeshrii commented 7 years ago

While Add Top layer LSTM training worked for Latin unicharset based languages (eng, nor), It is failing for Arabic.

I am copying below the log for creating lstmf files and then for the training.

Shreeshrii commented 7 years ago
$ training/tesstrain.sh --fonts_dir /home/shree/.fonts --lang ara    --linedata_only --noextract_font_properties
   --langdata_dir ../langdata --tessdata_dir ./tessdata   --output_dir ~/tesstutorial/aralayer

=== Starting training for language 'ara'
[Sat Jan 7 10:09:33 DST 2017] /usr/local/bin/text2image --fonts_dir=/home/shree/.fonts --font=Arial Unicode MS --outputbase=/tmp/font_tmp.0Tqbe3jIFz/sample_text.txt --text
=/tmp/font_tmp.0Tqbe3jIFz/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz
Rendered page 0 to file /tmp/font_tmp.0Tqbe3jIFz/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Arial Unicode MS
Rendering using Amiri
Rendering using Arial
Rendering using Scheherazade
Rendering using Calibri
Rendering using Tahoma
Rendering using FreeSerif
Rendering using Microsoft Sans Serif
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0 --font=Arial Unicode MS --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0 --font=Amiri --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0 --font=Arial --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0 --font=Scheherazade --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0 --font=Calibri --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0 --font=Tahoma --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0 --font=FreeSerif --text=../langdata/ara/ara.training_text
Stripped 15 unrenderable words
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0 --font=Microsoft Sans Serif --text=../langdata/ara/ara.training_text
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Stripped 2 unrenderable words
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Stripped 13 unrenderable words
Stripped 15 unrenderable words
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif
Rendering using Times New Roman,
Rendering using Courier New
Rendering using Traditional Arabic
[Sat Jan 7 10:10:02 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0 --font=Times New Roman, --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:10:03 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0 --font=Courier New --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:10:03 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0 --font=Traditional Arabic --text=../langdata/ara/ara.training_text
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sat Jan 7 10:10:13 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.Ey23alPX8e/ara/ /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Arial.
exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.box /tmp/tmp.Ey23a
lPX8e/ara/ara.FreeSerif.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Ta
homa.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.box
Wrote unicharset file /tmp/tmp.Ey23alPX8e/ara//unicharset.
[Sat Jan 7 10:10:14 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.Ey23alPX8e/ara/ara.unicharset -O /tmp/tmp.Ey23alPX8e/ara/ara.unicharset -X /tmp/tmp.Ey23
alPX8e/ara/ara.xheights --script_dir=../langdata
Loaded unicharset of size 381 from file /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Setting unichar properties
Mirror { of } is not in unicharset
Writing unicharset to file /tmp/tmp.Ey23alPX8e/ara/ara.unicharset

=== Phase D: Generating Dawg files ===
Generating word Dawg
[Sat Jan 7 10:10:14 DST 2017] /usr/local/bin/wordlist2dawg -r 1 ../langdata/ara/ara.wordlist /tmp/tmp.Ey23alPX8e/ara/ara.word-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Set reverse_policy to RRP_REVERSE_IF_HAS_RTL
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.wordlist'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.word-dawg'
Generating frequent-word Dawg
[Sat Jan 7 10:10:20 DST 2017] /usr/local/bin/wordlist2dawg -r 1 /tmp/tmp.Ey23alPX8e/ara/ara.wordlist.clean.freq /tmp/tmp.Ey23alPX8e/ara/ara.freq-dawg /tmp/tmp.Ey23alPX8e/a
ra/ara.unicharset
Set reverse_policy to RRP_REVERSE_IF_HAS_RTL
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '/tmp/tmp.Ey23alPX8e/ara/ara.wordlist.clean.freq'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.freq-dawg'
[Sat Jan 7 10:10:20 DST 2017] /usr/local/bin/wordlist2dawg -r 2 ../langdata/ara/ara.punc /tmp/tmp.Ey23alPX8e/ara/ara.punc-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Set reverse_policy to RRP_FORCE_REVERSE
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.punc'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.punc-dawg'
[Sat Jan 7 10:10:21 DST 2017] /usr/local/bin/wordlist2dawg -r 0 ../langdata/ara/ara.numbers /tmp/tmp.Ey23alPX8e/ara/ara.number-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Set reverse_policy to RRP_DO_NO_REVERSE
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.numbers'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.number-dawg'
[Sat Jan 7 10:10:21 DST 2017] /usr/local/bin/wordlist2dawg -r 1 ../langdata/ara/ara.word.bigrams /tmp/tmp.Ey23alPX8e/ara/ara.bigram-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicha
rset
Set reverse_policy to RRP_REVERSE_IF_HAS_RTL
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.word.bigrams'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.bigram-dawg'

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Sat Jan 7 10:10:31 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0 lstm.train ../langdata/ara/ara.con
fig
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0 lstm.train .
./langdata/ara/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0 lstm.train ../langdata/ara/ara.con
fig
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0 lstm.train ../langdata/ara/ara
.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0 lstm.train ../langdata
/ara/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0 lstm.train ../langdata/ara
/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0 lstm
.train ../langdata/ara/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0 lstm.train ../langda
ta/ara/ara.config
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Detected 1300 diacritics
Detected 675 diacritics
Detected 923 diacritics
Page 2
Page 2
No block overlapping textline: اونُمَآ نَيذِلَّا اوقُلَ اذَإِوَ نَومُلَعْيَ الَ نْكِلَوَ ءُاهَفَسُّلا مُهُ مْهُنَّإِ الَأَ ءُاهَفَسُّلا نَمَآ اكَمَ
No block overlapping textline: امَّلَفَ ارًانَ دَقَوْتَسْا يذِلَّا لِثَمَكَ مْهُلُثَمَ نَيدِتَهْمُ اونُاكَ امَوَ مْهُتُرَاجَتِ تْحَبِرَ امَفَ ىدَهُلْابِ
Page 2
Page 2
Page 2
Page 2
Page 2
Page 2
Loaded 39/39 pages (1-39) of document /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.lstmf
Page 3
Loaded 55/55 pages (1-55) of document /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.lstmf
Loaded 53/53 pages (1-53) of document /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.lstmf
Loaded 55/55 pages (1-55) of document /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 50/50 pages (1-50) of document /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 36/36 pages (1-36) of document /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.lstmf
Loaded 59/59 pages (1-59) of document /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.lstmf
Page 3
Page 3
Page 3
Loaded 83/83 pages (1-83) of document /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.lstmf
Loaded 55/55 pages (1-55) of document /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.lstmf
Loaded 109/109 pages (1-109) of document /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.lstmf
Loaded 100/100 pages (1-100) of document /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 79/79 pages (1-79) of document /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.lstmf
[Sat Jan 7 10:10:59 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0 lstm.train ../langdata/ara/ara.c
onfig
[Sat Jan 7 10:10:59 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0 lstm.train ../
langdata/ara/ara.config
[Sat Jan 7 10:10:59 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0 lstm.tra
in ../langdata/ara/ara.config
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Page 2
Page 2
Page 2
Loaded 43/43 pages (1-43) of document /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.lstmf
Page 3
Loaded 56/56 pages (1-56) of document /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.lstmf
Loaded 53/53 pages (1-53) of document /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.lstmf
Page 3
Loaded 90/90 pages (1-90) of document /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.lstmf
Loaded 109/109 pages (1-109) of document /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.lstmf

=== Constructing LSTM training data ===
Creating new directory /home/shree/tesstutorial/aralayer
Copying ../langdata/ara/ara.config to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.unicharset to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.number-dawg to /home/shree/tesstutorial/aralayer/ara.lstm-number-dawg
Moving /tmp/tmp.Ey23alPX8e/ara/ara.punc-dawg to /home/shree/tesstutorial/aralayer/ara.lstm-punc-dawg
Moving /tmp/tmp.Ey23alPX8e/ara/ara.word-dawg to /home/shree/tesstutorial/aralayer/ara.lstm-word-dawg
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.lstmf to /home/shree/tesstutorial/aralayer

Completed training for language 'ara'
Shreeshrii commented 7 years ago
$ mkdir -p ~/tesstutorial/aralayer_from_ara
$ combine_tessdata -e ../tessdata/ara.traineddata \
>   ~/tesstutorial/aralayer_from_ara/ara.lstm
Extracting tessdata components from ../tessdata/ara.traineddata
Wrote /home/shree/tesstutorial/aralayer_from_ara/ara.lstm
$
$  lstmtraining -U ~/tesstutorial/aralayer/ara.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/aralayer_from_ara/ara.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --learning_rate 10e-5 \
>   --net_mode 192 \
>   --perfect_sample_delay 19 \
>   --model_output ~/tesstutorial/aralayer_from_ara/aralayer \
>   --train_listfile ~/tesstutorial/aralayer/ara.training_files.txt \
>   --eval_listfile ~/tesstutorial/ara/ara.training_files.txt \
>   --max_iterations 50000
Loaded file /home/shree/tesstutorial/aralayer_from_ara/ara.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from /home/shree/tesstutorial/aralayer_from_ara/ara.lstm
Mirror { of } is not in unicharset
Appending a new network to an old one!!Setting unichar properties
Setting properties for script Common
Setting properties for script Latin
Setting properties for script Arabic
Warning: given outputs 105 not equal to unicharset of 106.
Num outputs,weights in serial:
  Lfx256:256, 394240
  Fc106:106, 27242
Total weights = 421482
Built network:[1,0,0,1[C5,5Ft16]Mp3,3Lfys64Lfx128Lrx128Lfx256Fc106] from request [Lfx256 O1c105]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.0001, momentum=0.9
Loaded 111/111 pages (1-111) of document /home/shree/tesstutorial/aralayer/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Scheherazade.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Tahoma.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Courier_New.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Calibri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.FreeSerif.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
At iteration 100/100/100, Mean rms=6.949%, delta=69.759%, char train=127.235%, word train=100%, skip ratio=0%,  New worst char error = 127.235 wrote checkpoint.

At iteration 200/200/200, Mean rms=6.558%, delta=62.072%, char train=116.738%, word train=100%, skip ratio=0%,  New worst char error = 116.738 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffff
d9 ffffff8e 20 20 ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffffd9 ffffff8e 20 ffffffd9
ffffff91 20 ffffffd8 ffffffa8 ffffffd9 ffffff91 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff8e 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 fff
fff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff90 20 ffffffd8 ffffffaf ffffffd9 ffffff8f ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffad ffffffd9 ffffff8e
ffffffd9 ffffff84 ffffffd9 ffffff92 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd9 ffffff8a ffffffd8 ffffffad ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffff
d9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd8 ffffffad ffffffd9 fff
fff92 ffffffd8 ffffffb1 ffffffd9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd8 ffffffb3 ffffffd9 ffffff92 ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: / بَـَتـكَ ةحتف تاكرحلا  ْ ٍ ِ ٌُ ً َ  ْ ٍ ِ ٌُ ً َ ّ بِّرَ هِلَّلِ دُمْحَلْا مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ
At iteration 300/300/301, Mean rms=6.463%, delta=59.695%, char train=111.691%, word train=100%, skip ratio=0.333%,  New worst char error = 111.691 wrote checkpoint.

At iteration 400/400/401, Mean rms=6.363%, delta=57.356%, char train=106.695%, word train=100%, skip ratio=0.25%,  New worst char error = 106.695 wrote checkpoint.

lstmtraining: ../ccutil/genericvector.h:696: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.
Aborted (core dumped)
Shreeshrii commented 7 years ago

This seems to be happening when an --eval_listfile is given. Seems to work if that is not given. See below:

shree@ALL-IN-1-TOUCH:/mnt/c/Users/User/shree/tesseract-ocr$  lstmtraining -U ~/tesstutorial/aralayer/ara.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/aralayer_from_ara/ara.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --learning_rate 10e-5 \
>   --net_mode 192 \
>   --perfect_sample_delay 19 \
>   --model_output ~/tesstutorial/aralayer_from_ara/aralayer \
>    --eval_listfile ~/tesstutorial/ara/ara.training_files.txt  \
>   --train_listfile ~/tesstutorial/aralayer/ara.training_files.txt \
>   --max_iterations 50000
Loaded file /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint
Loaded 111/111 pages (1-111) of document /home/shree/tesstutorial/aralayer/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Calibri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Courier_New.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.FreeSerif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Tahoma.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Scheherazade.exp0.lstmf
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 31/113 pages (83-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 31/113 pages (83-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
At iteration 16533/33300/33327, Mean rms=0.79%, delta=0.326%, char train=2.38%, word train=11.082%, skip ratio=0.1%,  New worst char error = 2.38 wrote checkpoint.

lstmtraining: ../ccutil/genericvector.h:696: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.
2 Percent improvement time=7141, best error was 4.338 @ 9418
Aborted (core dumped)

without --eval_listfile process continues

 shree@ALL-IN-1-TOUCH:/mnt/c/Users/User/shree/tesseract-ocr$  lstmtraining -U ~/tesstutorial/aralayer/ara.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/aralayer_from_ara/ara.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --learning_rate 10e-5 \
>   --net_mode 192 \
>   --perfect_sample_delay 19 \
>   --model_output ~/tesstutorial/aralayer_from_ara/aralayer \
>    --train_listfile ~/tesstutorial/aralayer/ara.training_files.txt \
>   --max_iterations 50000
Loaded file /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint
Loaded 111/111 pages (1-111) of document /home/shree/tesstutorial/aralayer/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Calibri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Scheherazade.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.FreeSerif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Tahoma.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Courier_New.exp0.lstmf
Loaded 22/113 pages (92-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
Loaded 22/113 pages (92-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
2 Percent improvement time=7141, best error was 4.338 @ 9418
At iteration 16559/33400/33427, Mean rms=0.776%, delta=0.33%, char train=2.313%, word train=10.483%, skip ratio=0.1%,  New best char error = 2.313 wrote best model:/home/s
hree/tesstutorial/aralayer_from_ara/aralayer2.313_16559.lstm wrote checkpoint.

2 Percent improvement time=7177, best error was 4.338 @ 9418
At iteration 16595/33500/33527, Mean rms=0.778%, delta=0.334%, char train=2.312%, word train=10.634%, skip ratio=0.1%,  New best char error = 2.312 wrote checkpoint.

Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
At iteration 16627/33600/33627, Mean rms=0.788%, delta=0.344%, char train=2.473%, word train=11.073%, skip ratio=0%,  New worst char error = 2.473 wrote checkpoint.

At iteration 16664/33700/33727, Mean rms=0.79%, delta=0.356%, char train=2.519%, word train=11.23%, skip ratio=0%,  New worst char error = 2.519 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffff
d9 ffffff8e 20 20 ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffffd9 ffffff8e 20 ffffffd9
ffffff91 20 ffffffd8 ffffffa8 ffffffd9 ffffff91 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff8e 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 fff
fff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff90 20 ffffffd8 ffffffaf ffffffd9 ffffff8f ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffad ffffffd9 ffffff8e
ffffffd9 ffffff84 ffffffd9 ffffff92 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd9 ffffff8a ffffffd8 ffffffad ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffff
d9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd8 ffffffad ffffffd9 fff
fff92 ffffffd8 ffffffb1 ffffffd9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd8 ffffffb3 ffffffd9 ffffff92 ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: / بَـَتـكَ ةحتف تاكرحلا  ْ ٍ ِ ٌُ ً َ  ْ ٍ ِ ٌُ ً َ ّ بِّرَ هِلَّلِ دُمْحَلْا مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ
ghost commented 7 years ago

@Shreeshrii I have noticed that the Arabic text in your log is reversed, Your log shows: مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ It should be: بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

A representation of this mistake, example: Correct: Peace Be Upon You Wrong: uoY nopU eB ecaeP

The Arabic language read/write from right to left ( RTL )

Shreeshrii commented 7 years ago

Thanks for pointing it out.

I neither know Arabic nor am familiar with bidi.

Is it just one line that is reversed or all?

I am using the training text from langdata, prefixed with sample with diacritics provided by @bmwmy along with few words copied from wikipedia.

I had copied the error msg from the console. I could try to save the log in a file to see if that is correct, since it is possible that my locale under bash on Windows 10 does not support Arabic.

On 10-Jan-2017 1:16 AM, "christophered" notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii I have noticed that the Arabic text in your log is reversed, Your log shows: مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ It should be: بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

A representation of this mistake, example: Correct: Peace Be Upon You Wrong: uoY nopU eB ecaeP

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/642#issuecomment-271387649, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o9eQiTRRJyspo6OSoBaTRMgYZRsHks5rQo6WgaJpZM4LdVVV .

bmwmy commented 7 years ago

@Shreeshrii could you post some generated image files (tif) to look if Arabic text is rendered correctly!

Shreeshrii commented 7 years ago

Please see attached, the zip file has the training text, box tiff pair and unicharset.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jan 10, 2017 at 2:55 PM, bmwmy notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii could you post some generated image files (tif) to look if Arabic text is rendered correctly!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/642#issuecomment-271526299, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o389hnCFPZQnP8q0ueqrdLdfTZB9ks5rQ05-gaJpZM4LdVVV .

ghost commented 7 years ago

@Shreeshrii

Shreeshrii commented 7 years ago

I had attached file via email. Maybe github does not allow that. Will upload on forum.

On 10-Jan-2017 5:30 PM, "christophered" notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii

-

All the Arabic language lines are reversed.

I am have checked the samples from #552 https://github.com/tesseract-ocr/tesseract/issues/552 The "Original_Text.txt" was encoded in (UTF-8-BOM) and everything seems okay.

So attach the tif/box that you are using I am not seeing any zip files here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/642#issuecomment-271558081, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o3pcEjeaz_dh6hSyK7S7E5g3vly2ks5rQ3LNgaJpZM4LdVVV .

Shreeshrii commented 7 years ago

ara.TRAINING.zip

Uploaded zip file with training data for a group of fonts which have coverage for Arabic on Windows.

It is possible that the tesstrain.sh process is dropping diacritics as noise. I am trying to change config variables to see if I can get some improvement.

Shreeshrii commented 7 years ago

Attached is a log file which shows verbose output for every iteration of training - from middle of current training session.

traininglog-mid.txt

ghost commented 7 years ago

@Shreeshrii What font size are you using for the "Traditional Arabic"?

Initial Observation:

When i used them in my training process, i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated ) Example: ( كـ ) is not ( ك + ـ ) in the box file, it should be ( ك ) also ( ـكـ ) or ( ـك ) they are a single character ( ك ) in different positions, this is important in the box file.

Which also means that ( كَـ ) is not ( ك + ـَ ), it is ( كَ )

ghost commented 7 years ago

@theraysmith @amitdo @Shreeshrii

Tesseract 4.0 lstm puts the spaces between the words into boxes, as you know. Thus a problem arises caused by the box file disorder since the boxes are mistakenly set to be in LTR ( Left to Right ) for Arabic which is wrong, causing jumps from ( the end of the first line) to ( the end of the last letter of the line after it). See the image attached box disorder

ghost commented 7 years ago

(Controlled Parnell/clock language and region/ region/ administrative/ change system locale/ Arabic "Saudi Arabia")

Also, when using txt, the words are not in their correct order. at google chrome the words are correct, but once copying them and pasting them in a text file, the order is change, what a weird issue.

ghost commented 7 years ago

@theraysmith @amitdo @Shreeshrii

Shreeshrii commented 7 years ago

@christophered

  1. I had experimented with 32 ptsize for Traditional Arabic in one run. I am using the default, which is 12 pt, I think.

  2. Don't/ Never set at all the letter extenders (Shift+j or Shift+ت) as a sole letter,

It is possible that I copied some text from wikipedia which is incorrect. Please look at the training_text file and let me know which lines should be deleted.

  1. i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated )

Please share your training text and I can give it a try.

Shreeshrii commented 7 years ago

Original problem, core dumped - This seems to be happening when an --eval_listfile is given. Related issues: https://github.com/tesseract-ocr/tesseract/issues/644 (eval not run) https://github.com/tesseract-ocr/tesseract/issues/561 (core dumped)

Arabic related issues: See new issue filed by @christophered
https://github.com/tesseract-ocr/tesseract/issues/648 (arabic reversal)

Closing this issue.

amitdo commented 7 years ago

Wrong encoding & Arabic language support by the text editor The Arabic language txt should be encoded in UTF-8 or any other that support it.

The langdata text files for all languages are saved using UTF-8 encoding.

imohammadhossein commented 5 years ago

i am trying to train or finetune tesseract for my own dataset on farsi language . can anyone please help me through this ?