tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.13k stars 9.4k forks source link

Feature Request: Utility to unpack lstmf files #2669

Open Shreeshrii opened 5 years ago

Shreeshrii commented 5 years ago

lstmf files contain the image information, ground truth text (and box file information?). https://github.com/tesseract-ocr/tesseract/blob/c40159aa740fcc50e25caf5fd2434a7e900b5ccd/src/ccstruct/imagedata.h#L196-L204

It will be useful to have a utility (similar to combine_tessdata used for traineddata files) which can be used to extract all the components of lstmf files.

This will be useful for verifying the correctness of the files as well as remove the necessity to save original input files (tif/groundtruth/box).

stweil commented 4 years ago

Which image format would you prefer? PNG? TIFF?

stweil commented 4 years ago

I think that the box file information is implicitly there, because box files for LSTM only need line texts and the bounding box for each line (so you won't get character boxes which might have existed in the initial box files).

Shreeshrii commented 4 years ago

lstmf files can be made from multi-page tifs also, so I would say tif as the output format.

Shreeshrii commented 4 years ago

Related request, I can open another issue, if you prefer...

Enhance combine_tessdata to also output info from the lstm files - it will be useful to know the network spec and whether the lstm model is compressed or not (integer vs float). This could then be interrogated for any start models given for training so that appropriate error can be reported.

Ray had sent info via email on tessdata_fast models - see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification-for-tessdata_fast and https://github.com/tesseract-ocr/tesseract/issues/1404#issuecomment-374680492

The network configuration is stored in the lstm data in the traineddata. With a small change to combine_tessdata, I produced the attached.

AyushP123 commented 4 years ago

I would love to work on creating a new utility or expanding combine_tessdata if anyone hasn't taken this up.

Shreeshrii commented 4 years ago

@stweil Any progress on this. Even a single line png version without box information will be useful. I am interested specially in checking how the RTL text is being stored in it.

stweil commented 4 years ago

It's less a technical problem but a question where to add it in a user friendly way. Technically the unpack feature would fit well into the tesseract executable because the relevant code parts are already there. But how can we extend the command line syntax for that program (which is already a mess today)? One possibility would be a syntax similar to the one used by git and others:

tesseract [<command>] ...

So tesseract can be followed by a command (for example recognize, lstmf info or lstmf unpack). That command is optional for backward compatibility.

@AyushP123, have you already started working on a utility?

Shreeshrii commented 4 years ago

I came across https://github.com/OpenArabic/OCR_GS_Data (used for Kraken Arabic models) and wanted to test training with it. But looks like the wordstrbox is not the right format, hence wanted to check. (file created using python script and using tesseract are in different order).

cat /home/ubuntu/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.gt.txt
الحسن: مصر عمر سبعة أمصار: المدينة، والبحرين، والبصرة، والكوفة،

python script - gt text copied as is

cat /home/ubuntu/OCR_GS_Data/ara/ground-truth/book_IbnFaqihHamadhani.Buldan_7_final_a_000004.box
WordStr 0 0 2977 170 0 #الحسن: مصر عمر سبعة أمصار: المدينة، والبحرين، والبصرة، والكوفة،
         0 0 2977 170 0

tesseract - order of text is reversed in wordstrbox

OMP_THREAD_LIMIT=1 tesseract /home/ubuntu/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.png - -l ara --tessdata-dir ~/tessdata_best wordstrbox
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 889
WordStr 1 39 2976 169 0 #.ةفوكلاو »ةرصبلاو »نيرحبلاو »ةنيدملا :راصمأ ةعبس رمع رِّصم :نسحلا
         2977 39 2981 169 0

tesseract - order of recognized text same as gt text

OMP_THREAD_LIMIT=1 tesseract /home/ubuntu/OCR_GS_Data/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.png - -l ara --tessdata-dir ~/tessdata_best
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 889
الحسن: مصِّر عمر سبعة أمصار: المدينة» والبحرين» والبصرة» والكوفة.
stweil commented 4 years ago

That looks interesting. https://github.com/OpenITI/ seems to include OCR_GS_Data and could be used to improve the Arabic model (or create a new one) for Tesseract.

A comparison of Tesseract with the published results from Kraken would also be interesting.

Shreeshrii commented 4 years ago

A comparison of Tesseract with the published results from Kraken would also be interesting.

My first trial shows Tesseract with worse results compared to Kraken. However, the accuracy is getting pulled down by a subset of lines that are being recognized as multiple lines.

example.

a_000003

gt = البصرة والكوفة، وقد تفعل العرب هذا فتسمي الاثنين باسم الجميع، وقال

tess = البصرة والكوفة لكوفة، وقذ ره تفعل العرب هذا فتسم فتسمي الأاثنين با مه ه ‎١‏ ‏سم لجميع؛ وقال

Shreeshrii commented 4 years ago

Works with --psm 7 or --psm 13. I will rerun the reports.

$ tesseract ara-err.png - -l ara --tessdata-dir ~/tessdata_best --psm 6  -c page_separator=''
Warning: Invalid resolution 0 dpi. Using 70 instead.
البصرة والكوفة
لكوفة,؛ وقد ن
تفعل العربة هذا فتسمٌ
فتسمّي الأثنين با
م ‎٠‏ '
سم لجميع؛ وقال

$ tesseract ara-err.png - -l ara --tessdata-dir ~/tessdata_best --psm 7  -c page_separator=''
Warning: Invalid resolution 0 dpi. Using 70 instead.
البصرة والكوفة» وقد تفعل العربة هذا فتسمّي الأثنين باسم الجميعم؛ وقال

$ tesseract ara-err.png - -l ara --tessdata-dir ~/tessdata_best --psm 13  -c page_separator=''
Warning: Invalid resolution 0 dpi. Using 70 instead.
البصرة والكوفة» وقد تفعل العربة هذا فتسمّي الأثنين باسم الجميعم؛ وقال
Shreeshrii commented 4 years ago
Buldan Kraken Tesseract
7_final_a Count Missed %Right
8188 45 99.45 ASCII Spacing Characters
1033 190 81.61 ASCII Special Symbols
203 26 87.19 ASCII Digits
40 40 0.00 Latin1 Spacing Characters
7 7 0.00 Latin1 Special Symbols
39179 124 99.68 Basic Arabic
1675 92 94.51 Arabic Extended
50325 524 98.96 Total
Count Missed %Right
8188 16 99.80 ASCII Spacing Characters
1033 54 94.77 ASCII Special Symbols
203 203 0.00 ASCII Digits
40 40 0.00 Latin1 Spacing Characters
7 0 100.00 Latin1 Special Symbols
39179 168 99.57 Basic Arabic
48650 481 99.01 Total
7_final_a_200 Count Missed %Right
6865 44 99.36 ASCII Spacing Characters
841 159 81.09 ASCII Special Symbols
158 42 73.42 ASCII Digits
33 33 0.00 Latin1 Spacing Characters
6 6 0.00 Latin1 Special Symbols
33142 104 99.69 Basic Arabic
1385 94 93.21 Arabic Extended
42430 482 98.86 Total
Count Missed %Right
6865 13 99.81 ASCII Spacing Characters
841 41 95.12 ASCII Special Symbols
158 158 0.00 ASCII Digits
33 33 0.00 Latin1 Spacing Characters
6 0 100.00 Latin1 Special Symbols
33142 148 99.55 Basic Arabic
41045 393 99.04 Total
7_final_b Count Missed %Right
8409 44 99.48 ASCII Spacing Characters
702 132 81.20 ASCII Special Symbols
103 13 87.38 ASCII Digits
13 13 0.00 Latin1 Spacing Characters
8 8 0.00 Latin1 Special Symbols
39509 121 99.69 Basic Arabic
1623 84 94.82 Arabic Extended
50367 415 99.18 Total
Count Missed %Right
8409 11 99.87 ASCII Spacing Characters
702 33 95.30 ASCII Special Symbols
103 103 0.00 ASCII Digits
13 13 0.00 Latin1 Spacing Characters
8 0 100.00 Latin1 Special Symbols
39509 1471 96.28 Basic Arabic
1330 1330 0.00 Arabic Extended
50074 2961 94.09 Total
7_final_b_200 Count Missed %Right
8409 43 99.49 ASCII Spacing Characters
702 142 79.77 ASCII Special Symbols
103 16 84.47 ASCII Digits
13 13 0.00 Latin1 Spacing Characters
8 8 0.00 Latin1 Special Symbols
39509 131 99.67 Basic Arabic
1623 110 93.22 Arabic Extended
50367 463 99.08 Total
Count Missed %Right
8409 11 99.87 ASCII Spacing Characters
702 33 95.30 ASCII Special Symbols
103 103 0.00 ASCII Digits
13 13 0.00 Latin1 Spacing Characters
8 0 100.00 Latin1 Special Symbols
39509 1471 96.28 Basic Arabic
1330 1330 0.00 Arabic Extended
50074 2961 94.09 Total
stweil commented 4 years ago

So Tesseract missed all digits and Arabic extended?

Shreeshrii commented 4 years ago

In their groundtruth files they have used 0-9 digits for the Arabic script digits. I substituted those so that the image and text would match. examples:

000120.png حدثنا بشر بن محمد بن أبان عن داود بن المخير عن الصلت [89 أ] بن دينار عن

https://github.com/OpenArabic/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final_a/a_000120.gt.txt

I haven't looked at Arabic extended to see what characters are there.

These results are based on using the finetuned traineddata on the training set (similar to what they have done in their research). So, it could be overfitted. I haven't tried it with other 'unseen' books for testing yet.

Shreeshrii commented 4 years ago

203 203 0.00 ASCII Digits

It is possible that my substitution also changed some 0-9 digits that were 0-9 in the images.

Shreeshrii commented 4 years ago

So Tesseract missed all digits and Arabic extended?

grep into Accuracy reports per image shows that Arabic extended is referring to digits in Arabic script. While they are not getting detected correctly with default psm, they are recognized with --psm 13.

book_IbnFaqihHamadhani Buldan_7_final_a_200-a_000184


   Count   Missed   %Right
       3        0   100.00   Arabic Extended
       3        0   100.00   Total

  Errors   Marked   Correct-Generated
       1        0   {}-{<\n>}

   Count   Missed   %Right
       1        0   100.00   {٠}
       1        0   100.00   {٤}
       1        0   100.00   {٦}

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 878 Empty page!! Estimating resolution as 878 Empty page!!

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 3 Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 878 Empty page!! Estimating resolution as 878 Empty page!!

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 4 Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 878 ٤٠٦

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 6 Warning: Invalid resolution 0 dpi. Using 70 instead. ٤٠٦

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 13 Warning: Invalid resolution 0 dpi. Using 70 instead. ٤٠٦

tesseract book_IbnFaqihHamadhani.Buldan_7_final_a_200-a_000184.png - -l araKraken --tessdata-dir ./ --psm 7 Warning: Invalid resolution 0 dpi. Using 70 instead. ٤٠٦

stweil commented 4 years ago

A first implementation is available in my unpack branch. It also introduces a new command line syntax. Usage:

tesseract unpack [LSTMF_FILE ...]

This writes two files (unpacked.gt.txt and unpacked.png, overwritten for each lstmf file, so currently not very useful) and shows the transcription and the first box information for each lstmf file.

Only simple lstmf files are currently handled. Do you have a more complex example (multi page tiff)?

Shreeshrii commented 4 years ago

Attached is a zip file with a sample of lstmf files in different languages, including multi-page tiff. Do you also need the corresponding images and transcription?

lstmf-samples.zip

Shreeshrii commented 4 years ago

@stweil The build is failing. https://travis-ci.org/stweil/tesseract/jobs/631055250#L552

libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../.. -O2 -DNDEBUG -I../../include -I./include -I../../src/arch -I../../src/ccmain -I../../src/ccstruct -I../../src/ccutil -I../../src/classify -I../../src/cutil -I../../src/dict -I../../src/lstm -I../../src/opencl -I../../src/textord -I../../src/training -I../../src/viewer -I../../src/wordrec -I/usr/include/leptonica -O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp -std=c++17 -MT src/api/libtesseract_la-hocrrenderer.lo -MD -MP -MF src/api/.deps/libtesseract_la-hocrrenderer.Tpo -c ../../src/api/hocrrenderer.cpp -o src/api/libtesseract_la-hocrrenderer.o
../../src/api/tesseractmain.cpp:342:10: fatal error: filesystem: No such file or directory
 #include <filesystem>
          ^~~~~~~~~~~~
compilation terminated.
Makefile:4011: recipe for target 'src/api/tesseract-tesseractmain.o' failed
make[2]: *** [src/api/tesseract-tesseractmain.o] Error 1
make[2]: *** Waiting for unfinished jobs....
mv -f src/api/.deps/libtesseract_la-lstmboxrenderer.Tpo src/api/.deps/libtesseract_la-lstmboxrenderer.Plo
mv -f src/api/.deps/libtesseract_la-hocrrenderer.Tpo src/api/.deps/libtesseract_la-hocrrenderer.Plo
mv -f src/api/.deps/libtesseract_la-baseapi.Tpo src/api/.deps/libtesseract_la-baseapi.Plo
make[2]: Leaving directory '/home/ubuntu/unpack/bin/master'
Makefile:4119: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/ubuntu/unpack/bin/master'
Makefile:1354: recipe for target 'all' failed
make: *** [all] Error 2
stweil commented 4 years ago

It requires C++-17. Try to build without that include statement.

stweil commented 4 years ago

Attached is a zip file with a sample of lstmf files in different languages, including multi-page tiff.

The latest test code should fix the build problem with C++ before C++17 and adds support for lstmf files with more than one text line. It handles all files in the sample, but I'm not sure that the result is as expected. The results are now written to individual files.

Shreeshrii commented 4 years ago

Still not able to build with -std=c++17

Making all in .
g++ -DHAVE_CONFIG_H -I. -I../..  -I../../src/arch -I../../src/ccstruct -I../../src/ccutil -I../../src/dict -I../../src/viewer -O2 -DNDEBUG -I../../include -I./include    -I/usr/include/leptonica  -O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp -std=c++17 -MT src/api/tesseract-tesseractmain.o -MD -MP -MF src/api/.deps/tesseract-tesseractmain.Tpo -c -o src/api/tesseract-tesseractmain.o `test -f 'src/api/tesseractmain.cpp' || echo '../../'`src/api/tesseractmain.cpp
make[2]: Entering directory '/home/ubuntu/tesseract/bin/power8'
g++ -DHAVE_CONFIG_H -I. -I../..  -I../../src/arch -I../../src/ccstruct -I../../src/ccutil -I../../src/dict -I../../src/viewer -O2 -DNDEBUG -I../../include -I./include    -I/usr/include/leptonica  -O3 -maltivec -mabi=altivec -mcpu=power8 -mtune=power8 -fopenmp -std=c++17 -MT src/api/tesseract-tesseractmain.o -MD -MP -MF src/api/.deps/tesseract-tesseractmain.Tpo -c -o src/api/tesseract-tesseractmain.o `test -f 'src/api/tesseractmain.cpp' || echo '../../'`src/api/tesseractmain.cpp
../../src/api/tesseractmain.cpp: In function ‘bool std::filesystem::exists(const char*)’:

../../src/api/tesseractmain.cpp: In function ‘bool std::filesystem::exists(const char*)’:
../../src/api/tesseractmain.cpp:408:10: error: ‘access’ was not declared in this scope
   return access(filename, 0) == 0;
          ^~~~~~
../../src/api/tesseractmain.cpp:408:10: note: suggested alternative: ‘acosl’
   return access(filename, 0) == 0;
          ^~~~~~
          acosl
../../src/api/tesseractmain.cpp: In function ‘bool std::filesystem::exists(const char*)’:
../../src/api/tesseractmain.cpp:408:10: error: ‘access’ was not declared in this scope
   return access(filename, 0) == 0;
          ^~~~~~
../../src/api/tesseractmain.cpp:408:10: note: suggested alternative: ‘acosl’
   return access(filename, 0) == 0;
          ^~~~~~
          acosl
Makefile:4013: recipe for target 'src/api/tesseract-tesseractmain.o' failed
make[1]: *** [src/api/tesseract-tesseractmain.o] Error 1
make[1]: Leaving directory '/home/ubuntu/tesseract/bin/power8'
Makefile:4121: recipe for target 'check-recursive' failed
make: *** [check-recursive] Error 1
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc64le-linux-gnu/7/lto-wrapper
Target: powerpc64le-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=powerpc64le-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-objc-gc=auto --enable-secureplt --with-cpu=power8 --enable-targets=powerpcle-linux --disable-multilib --enable-multiarch --disable-werror --with-long-double-128 --enable-checking=release --build=powerpc64le-linux-gnu --host=powerpc64le-linux-gnu --target=powerpc64le-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
stweil commented 4 years ago

That's fixed now. An include statement was missing.

Shreeshrii commented 4 years ago

@stweil Thank you for adding this functionality. It will be a very useful feature.

I tested with a multi-page tiff generated by text2image for langdata/eng/eng.training_text. tesseract unpack creates single lines images and their groundtruth transcription.

While all lines in .tif are converted to .png line images, the order of images does not match the order of text lines in original training_text, tif and box. eg. line 50 in text file was image number 71.

The line image numbering is continuous for the lstmf file, there is no indication of page numbering of original tif.

line level box files are not generated.

engeval.zip - Multipage tif

I tested for Devanagari script, using both text2image generated box files and also using the wordstrbox files with a single text line as input. In both cases the output .png and .gt.txt were created correctly.

hineval.zip - text2image

saneval.zip - wordstrbox

For RTL, I tested with Arabic text. While the generated png and original tif match, the training text and generated gt.txt files do not match. I have not tested with Hebrew text yet.

araeval.zip - RTL

Shreeshrii commented 4 years ago

hebeval.zip

heb Arial_Unicode_MS exp0_0

training text

אחרי אחת אבל מידע כמה במסגרת נולד לו של למרות ב' ב־4

generated gt.txt

4־ב 'ב תורמל לש ול דלונ תרגסמב המכ עדימ לבא תחא ירחא
Shreeshrii commented 4 years ago

The reversal for RTL languages is probably to be expected since order of text is reversed in the box files. I will test further regarding the reversal in case of digits, punctuation etc.

stweil commented 4 years ago

the order of images does not match the order of text lines in original training_text

That's correct. Tesseract shuffles the original data, see https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/linerec.cpp#L69. As it uses a pseudo random sequence which is derived from the document name, it might be possible to reverse that shuffling.

stweil commented 4 years ago

line level box files are not generated.

The internal data has character level boxes. Writing that data to box files must still be implemented.

Shreeshrii commented 4 years ago

The reversal for RTL languages is probably to be expected since order of text is reversed in the box files.

Well, thinking further about it, reversed text is ok if generating box files via tesseract unpack. But if generating ground truth, then text should be in same order as original text/ recognized text.

Shreeshrii commented 4 years ago

kur_araeval.zip

kur_ara Amiri exp0_0

original training_text

غير الموقع أن مركز برامج حتى الرمزية من يكون 24 - يوم

wordstrbox

WordStr 0 0 905 82 0 #موي - 24 نوكي نم ةيزمرلا ىتح جمارب زكرم نأ عقوملا ريغ 0 0 905 82 0

text generated by tesseract unpack (matches BOX not GT)

موي - 24 نوكي نم ةيزمرلا ىتح جمارب زكرم نأ عقوملا ريغ

Shreeshrii commented 4 years ago

Also tested with jpn and chi_sim - one line of training_text. Works as expected.

chi_simeval.zip jpneval.zip

Shreeshrii commented 4 years ago

@stweil This is a useful feature. I would suggest applying it to tesseract repo, with a comment about RTL as a TODO.

stweil commented 4 years ago

I am still looking for a solution how to output RTL text either with the ICU library or with existing Tesseract functions.

stweil commented 4 years ago

pybidi (part of the Python package python-bidi) can be used to fix the generated GT text.

Shreeshrii commented 4 years ago

Pybidi is what is used in the PR in tesstrain.

An earlier poster regarding RTL training had suggested fribidi

Shreeshrii commented 4 years ago

@stweil I am using this feature to convert lstmf files from text2image generated box and multi page tiffs to single line png images and gt.txt files that can be used with tesstrain.

Please consider commiting this to master branch.

Shreeshrii commented 3 years ago

@stweil Any plans to merge your changes? I am not able to use the commits from your repo because of merge conflicts.

stweil commented 3 years ago

I rebased https://github.com/stweil/tesseract/tree/unpack now, so it should be possible to use it again.

The branch not only adds the unpack command but also the infocommand. Before that gets merged, I'd like to improve the code further.

Shreeshrii commented 3 years ago

Thanks, @stweil.

I will add another unpack feature request since you are looking to improve this further.

While creating the starter traineddata (proto model) tesseract outputs a readable version of the recoder (created by combine_lang_model command, I think). Files are named as [MODEL_NAME].charset_size=NNN.txt.

The current info produced from the traineddata file using combine_tessdata does not create this readable format. It will be good to have an option to do so.

EDIT: In some cases I have seen two entries for NULL in these files and I wonder if that is correct.

0    
1   <nul>
2   <nul>
3   /
4   (
5   व
6   ि
7   श
8   ल
9   य
10  े
Shreeshrii commented 2 years ago

@stweil Any plans to include this for 5.0.0

stweil commented 2 years ago

No, it won't be included in 5.0.0, but maybe in some later version (5.1.0?).