Open Shreeshrii opened 5 years ago
Which image format would you prefer? PNG? TIFF?
I think that the box file information is implicitly there, because box files for LSTM only need line texts and the bounding box for each line (so you won't get character boxes which might have existed in the initial box files).
lstmf files can be made from multi-page tifs also, so I would say tif as the output format.
Related request, I can open another issue, if you prefer...
Enhance combine_tessdata
to also output info from the lstm
files - it will be useful to know the network spec and whether the lstm model is compressed or not (integer vs float). This could then be interrogated for any start models given for training so that appropriate error can be reported.
Ray had sent info via email on tessdata_fast models - see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification-for-tessdata_fast and https://github.com/tesseract-ocr/tesseract/issues/1404#issuecomment-374680492
The network configuration is stored in the lstm data in the traineddata. With a small change to combine_tessdata, I produced the attached.
I would love to work on creating a new utility or expanding combine_tessdata if anyone hasn't taken this up.
@stweil Any progress on this. Even a single line png version without box information will be useful. I am interested specially in checking how the RTL text is being stored in it.
It's less a technical problem but a question where to add it in a user friendly way. Technically the unpack feature would fit well into the tesseract
executable because the relevant code parts are already there. But how can we extend the command line syntax for that program (which is already a mess today)? One possibility would be a syntax similar to the one used by git
and others:
tesseract [<command>] ...
So tesseract
can be followed by a command (for example recognize
, lstmf info
or lstmf unpack
). That command is optional for backward compatibility.
@AyushP123, have you already started working on a utility?
I came across https://github.com/OpenArabic/OCR_GS_Data (used for Kraken Arabic models) and wanted to test training with it. But looks like the wordstrbox is not the right format, hence wanted to check. (file created using python script and using tesseract are in different order).
cat /home/ubuntu/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.gt.txt
الحسن: مصر عمر سبعة أمصار: المدينة، والبحرين، والبصرة، والكوفة،
python script - gt text copied as is
cat /home/ubuntu/OCR_GS_Data/ara/ground-truth/book_IbnFaqihHamadhani.Buldan_7_final_a_000004.box
WordStr 0 0 2977 170 0 #الحسن: مصر عمر سبعة أمصار: المدينة، والبحرين، والبصرة، والكوفة،
0 0 2977 170 0
tesseract - order of text is reversed in wordstrbox
OMP_THREAD_LIMIT=1 tesseract /home/ubuntu/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.png - -l ara --tessdata-dir ~/tessdata_best wordstrbox
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 889
WordStr 1 39 2976 169 0 #.ةفوكلاو »ةرصبلاو »نيرحبلاو »ةنيدملا :راصمأ ةعبس رمع رِّصم :نسحلا
2977 39 2981 169 0
tesseract - order of recognized text same as gt text
OMP_THREAD_LIMIT=1 tesseract /home/ubuntu/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.png - -l ara --tessdata-dir ~/tessdata_best
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 889
الحسن: مصِّر عمر سبعة أمصار: المدينة» والبحرين» والبصرة» والكوفة.
That looks interesting. https://github.com/OpenITI/ seems to include OCR_GS_Data and could be used to improve the Arabic model (or create a new one) for Tesseract.
A comparison of Tesseract with the published results from Kraken would also be interesting.
A comparison of Tesseract with the published results from Kraken would also be interesting.
My first trial shows Tesseract with worse results compared to Kraken. However, the accuracy is getting pulled down by a subset of lines that are being recognized as multiple lines.
example.
gt = البصرة والكوفة، وقد تفعل العرب هذا فتسمي الاثنين باسم الجميع، وقال
tess = البصرة والكوفة لكوفة، وقذ ره تفعل العرب هذا فتسم فتسمي الأاثنين با مه ه ١ سم لجميع؛ وقال
Works with --psm 7 or --psm 13. I will rerun the reports.
$ tesseract ara-err.png - -l ara --tessdata-dir ~/tessdata_best --psm 6 -c page_separator=''
Warning: Invalid resolution 0 dpi. Using 70 instead.
البصرة والكوفة
لكوفة,؛ وقد ن
تفعل العربة هذا فتسمٌ
فتسمّي الأثنين با
م ٠ '
سم لجميع؛ وقال
$ tesseract ara-err.png - -l ara --tessdata-dir ~/tessdata_best --psm 7 -c page_separator=''
Warning: Invalid resolution 0 dpi. Using 70 instead.
البصرة والكوفة» وقد تفعل العربة هذا فتسمّي الأثنين باسم الجميعم؛ وقال
$ tesseract ara-err.png - -l ara --tessdata-dir ~/tessdata_best --psm 13 -c page_separator=''
Warning: Invalid resolution 0 dpi. Using 70 instead.
البصرة والكوفة» وقد تفعل العربة هذا فتسمّي الأثنين باسم الجميعم؛ وقال
Buldan | Kraken | Tesseract |
---|---|---|
7_final_a | Count Missed %Right 8188 45 99.45 ASCII Spacing Characters 1033 190 81.61 ASCII Special Symbols 203 26 87.19 ASCII Digits 40 40 0.00 Latin1 Spacing Characters 7 7 0.00 Latin1 Special Symbols 39179 124 99.68 Basic Arabic 1675 92 94.51 Arabic Extended 50325 524 98.96 Total |
Count Missed %Right 8188 16 99.80 ASCII Spacing Characters 1033 54 94.77 ASCII Special Symbols 203 203 0.00 ASCII Digits 40 40 0.00 Latin1 Spacing Characters 7 0 100.00 Latin1 Special Symbols 39179 168 99.57 Basic Arabic 48650 481 99.01 Total |
7_final_a_200 | Count Missed %Right 6865 44 99.36 ASCII Spacing Characters 841 159 81.09 ASCII Special Symbols 158 42 73.42 ASCII Digits 33 33 0.00 Latin1 Spacing Characters 6 6 0.00 Latin1 Special Symbols 33142 104 99.69 Basic Arabic 1385 94 93.21 Arabic Extended 42430 482 98.86 Total |
Count Missed %Right 6865 13 99.81 ASCII Spacing Characters 841 41 95.12 ASCII Special Symbols 158 158 0.00 ASCII Digits 33 33 0.00 Latin1 Spacing Characters 6 0 100.00 Latin1 Special Symbols 33142 148 99.55 Basic Arabic 41045 393 99.04 Total |
7_final_b | Count Missed %Right 8409 44 99.48 ASCII Spacing Characters 702 132 81.20 ASCII Special Symbols 103 13 87.38 ASCII Digits 13 13 0.00 Latin1 Spacing Characters 8 8 0.00 Latin1 Special Symbols 39509 121 99.69 Basic Arabic 1623 84 94.82 Arabic Extended 50367 415 99.18 Total |
Count Missed %Right 8409 11 99.87 ASCII Spacing Characters 702 33 95.30 ASCII Special Symbols 103 103 0.00 ASCII Digits 13 13 0.00 Latin1 Spacing Characters 8 0 100.00 Latin1 Special Symbols 39509 1471 96.28 Basic Arabic 1330 1330 0.00 Arabic Extended 50074 2961 94.09 Total |
7_final_b_200 | Count Missed %Right 8409 43 99.49 ASCII Spacing Characters 702 142 79.77 ASCII Special Symbols 103 16 84.47 ASCII Digits 13 13 0.00 Latin1 Spacing Characters 8 8 0.00 Latin1 Special Symbols 39509 131 99.67 Basic Arabic 1623 110 93.22 Arabic Extended 50367 463 99.08 Total |
Count Missed %Right 8409 11 99.87 ASCII Spacing Characters 702 33 95.30 ASCII Special Symbols 103 103 0.00 ASCII Digits 13 13 0.00 Latin1 Spacing Characters 8 0 100.00 Latin1 Special Symbols 39509 1471 96.28 Basic Arabic 1330 1330 0.00 Arabic Extended 50074 2961 94.09 Total |
So Tesseract missed all digits and Arabic extended?
In their groundtruth files they have used 0-9 digits for the Arabic script digits. I substituted those so that the image and text would match. examples:
حدثنا بشر بن محمد بن أبان عن داود بن المخير عن الصلت [89 أ] بن دينار عن
I haven't looked at Arabic extended to see what characters are there.
These results are based on using the finetuned traineddata on the training set (similar to what they have done in their research). So, it could be overfitted. I haven't tried it with other 'unseen' books for testing yet.
203 203 0.00 ASCII Digits
It is possible that my substitution also changed some 0-9 digits that were 0-9 in the images.
So Tesseract missed all digits and Arabic extended?
grep into Accuracy reports per image shows that Arabic extended is referring to digits in Arabic script. While they are not getting detected correctly with default psm, they are recognized with --psm 13.
Count Missed %Right
3 0 100.00 Arabic Extended
3 0 100.00 Total
Errors Marked Correct-Generated
1 0 {}-{<\n>}
Count Missed %Right
1 0 100.00 {٠}
1 0 100.00 {٤}
1 0 100.00 {٦}
tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 878 Empty page!! Estimating resolution as 878 Empty page!!
tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 3 Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 878 Empty page!! Estimating resolution as 878 Empty page!!
tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 4 Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 878 ٤٠٦
tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 6 Warning: Invalid resolution 0 dpi. Using 70 instead. ٤٠٦
tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 13 Warning: Invalid resolution 0 dpi. Using 70 instead. ٤٠٦
tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 7 Warning: Invalid resolution 0 dpi. Using 70 instead. ٤٠٦
A first implementation is available in my unpack branch. It also introduces a new command line syntax. Usage:
tesseract unpack [LSTMF_FILE ...]
This writes two files (unpacked.gt.txt
and unpacked.png
, overwritten for each lstmf file, so currently not very useful) and shows the transcription and the first box information for each lstmf file.
Only simple lstmf files are currently handled. Do you have a more complex example (multi page tiff)?
Attached is a zip file with a sample of lstmf files in different languages, including multi-page tiff. Do you also need the corresponding images and transcription?
@stweil The build is failing. https://travis-ci.org/stweil/tesseract/jobs/631055250#L552
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I../.. -O2 -DNDEBUG -I../../include -I./include -I../../src/arch -I../../src/ccmain -I../../src/ccstruct -I../../src/ccutil -I../../src/classify -I../../src/cutil -I../../src/dict -I../../src/lstm -I../../src/opencl -I../../src/textord -I../../src/training -I../../src/viewer -I../../src/wordrec -I/usr/include/leptonica -O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp -std=c++17 -MT src/api/libtesseract_la-hocrrenderer.lo -MD -MP -MF src/api/.deps/libtesseract_la-hocrrenderer.Tpo -c ../../src/api/hocrrenderer.cpp -o src/api/libtesseract_la-hocrrenderer.o
../../src/api/tesseractmain.cpp:342:10: fatal error: filesystem: No such file or directory
#include <filesystem>
^~~~~~~~~~~~
compilation terminated.
Makefile:4011: recipe for target 'src/api/tesseract-tesseractmain.o' failed
make[2]: *** [src/api/tesseract-tesseractmain.o] Error 1
make[2]: *** Waiting for unfinished jobs....
mv -f src/api/.deps/libtesseract_la-lstmboxrenderer.Tpo src/api/.deps/libtesseract_la-lstmboxrenderer.Plo
mv -f src/api/.deps/libtesseract_la-hocrrenderer.Tpo src/api/.deps/libtesseract_la-hocrrenderer.Plo
mv -f src/api/.deps/libtesseract_la-baseapi.Tpo src/api/.deps/libtesseract_la-baseapi.Plo
make[2]: Leaving directory '/home/ubuntu/unpack/bin/master'
Makefile:4119: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/ubuntu/unpack/bin/master'
Makefile:1354: recipe for target 'all' failed
make: *** [all] Error 2
It requires C++-17. Try to build without that include statement.
Attached is a zip file with a sample of lstmf files in different languages, including multi-page tiff.
The latest test code should fix the build problem with C++ before C++17 and adds support for lstmf files with more than one text line. It handles all files in the sample, but I'm not sure that the result is as expected. The results are now written to individual files.
Still not able to build with -std=c++17
Making all in .
g++ -DHAVE_CONFIG_H -I. -I../.. -I../../src/arch -I../../src/ccstruct -I../../src/ccutil -I../../src/dict -I../../src/viewer -O2 -DNDEBUG -I../../include -I./include -I/usr/include/leptonica -O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp -std=c++17 -MT src/api/tesseract-tesseractmain.o -MD -MP -MF src/api/.deps/tesseract-tesseractmain.Tpo -c -o src/api/tesseract-tesseractmain.o `test -f 'src/api/tesseractmain.cpp' || echo '../../'`src/api/tesseractmain.cpp
make[2]: Entering directory '/home/ubuntu/tesseract/bin/power8'
g++ -DHAVE_CONFIG_H -I. -I../.. -I../../src/arch -I../../src/ccstruct -I../../src/ccutil -I../../src/dict -I../../src/viewer -O2 -DNDEBUG -I../../include -I./include -I/usr/include/leptonica -O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp -std=c++17 -MT src/api/tesseract-tesseractmain.o -MD -MP -MF src/api/.deps/tesseract-tesseractmain.Tpo -c -o src/api/tesseract-tesseractmain.o `test -f 'src/api/tesseractmain.cpp' || echo '../../'`src/api/tesseractmain.cpp
../../src/api/tesseractmain.cpp: In function ‘bool std::filesystem::exists(const char*)’:
../../src/api/tesseractmain.cpp: In function ‘bool std::filesystem::exists(const char*)’:
../../src/api/tesseractmain.cpp:408:10: error: ‘access’ was not declared in this scope
return access(filename, 0) == 0;
^~~~~~
../../src/api/tesseractmain.cpp:408:10: note: suggested alternative: ‘acosl’
return access(filename, 0) == 0;
^~~~~~
acosl
../../src/api/tesseractmain.cpp: In function ‘bool std::filesystem::exists(const char*)’:
../../src/api/tesseractmain.cpp:408:10: error: ‘access’ was not declared in this scope
return access(filename, 0) == 0;
^~~~~~
../../src/api/tesseractmain.cpp:408:10: note: suggested alternative: ‘acosl’
return access(filename, 0) == 0;
^~~~~~
acosl
Makefile:4013: recipe for target 'src/api/tesseract-tesseractmain.o' failed
make[1]: *** [src/api/tesseract-tesseractmain.o] Error 1
make[1]: Leaving directory '/home/ubuntu/tesseract/bin/power8'
Makefile:4121: recipe for target 'check-recursive' failed
make: *** [check-recursive] Error 1
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc64le-linux-gnu/7/lto-wrapper
Target: powerpc64le-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=powerpc64le-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-objc-gc=auto --enable-secureplt --with-cpu=power8 --enable-targets=powerpcle-linux --disable-multilib --enable-multiarch --disable-werror --with-long-double-128 --enable-checking=release --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
That's fixed now. An include statement was missing.
@stweil Thank you for adding this functionality. It will be a very useful feature.
I tested with a multi-page tiff generated by text2image for langdata/eng/eng.training_text. tesseract unpack
creates single lines images and their groundtruth transcription.
While all lines in .tif are converted to .png line images, the order of images does not match the order of text lines in original training_text, tif and box. eg. line 50 in text file was image number 71.
The line image numbering is continuous for the lstmf file, there is no indication of page numbering of original tif.
line level box files are not generated.
I tested for Devanagari script, using both text2image generated box files and also using the wordstrbox files with a single text line as input. In both cases the output .png and .gt.txt were created correctly.
For RTL, I tested with Arabic text. While the generated png and original tif match, the training text and generated gt.txt files do not match. I have not tested with Hebrew text yet.
training text
אחרי אחת אבל מידע כמה במסגרת נולד לו של למרות ב' ב־4
generated gt.txt
4־ב 'ב תורמל לש ול דלונ תרגסמב המכ עדימ לבא תחא ירחא
The reversal for RTL languages is probably to be expected since order of text is reversed in the box files. I will test further regarding the reversal in case of digits, punctuation etc.
the order of images does not match the order of text lines in original training_text
That's correct. Tesseract shuffles the original data, see https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/linerec.cpp#L69. As it uses a pseudo random sequence which is derived from the document name, it might be possible to reverse that shuffling.
line level box files are not generated.
The internal data has character level boxes. Writing that data to box files must still be implemented.
The reversal for RTL languages is probably to be expected since order of text is reversed in the box files.
Well, thinking further about it, reversed text is ok if generating box files via tesseract unpack
. But if generating ground truth, then text should be in same order as original text/ recognized text.
original training_text
غير الموقع أن مركز برامج حتى الرمزية من يكون 24 - يوم
wordstrbox
WordStr 0 0 905 82 0 #موي - 24 نوكي نم ةيزمرلا ىتح جمارب زكرم نأ عقوملا ريغ 0 0 905 82 0
text generated by tesseract unpack
(matches BOX not GT)
موي - 24 نوكي نم ةيزمرلا ىتح جمارب زكرم نأ عقوملا ريغ
Also tested with jpn
and chi_sim
- one line of training_text. Works as expected.
@stweil This is a useful feature. I would suggest applying it to tesseract repo, with a comment about RTL as a TODO.
I am still looking for a solution how to output RTL text either with the ICU library or with existing Tesseract functions.
pybidi
(part of the Python package python-bidi) can be used to fix the generated GT text.
Pybidi is what is used in the PR in tesstrain.
An earlier poster regarding RTL training had suggested fribidi
@stweil I am using this feature to convert lstmf files from text2image generated box and multi page tiffs to single line png images and gt.txt files that can be used with tesstrain.
Please consider commiting this to master branch.
@stweil Any plans to merge your changes? I am not able to use the commits from your repo because of merge conflicts.
I rebased https://github.com/stweil/tesseract/tree/unpack now, so it should be possible to use it again.
The branch not only adds the unpack
command but also the info
command. Before that gets merged, I'd like to improve the code further.
Thanks, @stweil.
I will add another unpack feature request since you are looking to improve this further.
While creating the starter traineddata (proto model) tesseract outputs a readable version of the recoder (created by combine_lang_model
command, I think). Files are named as [MODEL_NAME].charset_size=NNN.txt
.
The current info produced from the traineddata file using combine_tessdata does not create this readable format. It will be good to have an option to do so.
EDIT: In some cases I have seen two entries for NULL in these files and I wonder if that is correct.
0
1 <nul>
2 <nul>
3 /
4 (
5 व
6 ि
7 श
8 ल
9 य
10 े
@stweil Any plans to include this for 5.0.0
No, it won't be included in 5.0.0, but maybe in some later version (5.1.0?).
lstmf files contain the image information, ground truth text (and box file information?). https://github.com/tesseract-ocr/tesseract/blob/c40159aa740fcc50e25caf5fd2434a7e900b5ccd/src/ccstruct/imagedata.h#L196-L204
It will be useful to have a utility (similar to combine_tessdata used for traineddata files) which can be used to extract all the components of lstmf files.
This will be useful for verifying the correctness of the files as well as remove the necessity to save original input files (tif/groundtruth/box).