tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.77k stars 9.35k forks source link

RFC: Remove the legacy OCR Engine #707

Closed amitdo closed 5 years ago

amitdo commented 7 years ago

Ray wants to get rid of the legacy OCR engine, so that the final 4.00 version will only have one OCR engine based on LSTM.

From #518:

@stweil commented:

I strongly vote against removing non-LSTM as we currently still get better results with it in some cases.

@theraysmith commented:

Please provide examples of where you get better results with the old engine. Right now I'm trying to work on getting rid of redundant code, rather than spending time fighting needless changes that generate a lot of work. I have recently tested an LSTM-based OSD, and it works a lot better than the old, so that is one more use of the old classifier that can go. AFAICT, apart from the equation detector, the old classifier is now redundant.

amitdo commented 6 years ago

I really don't know.

Shreeshrii commented 6 years ago

Jeff, There are also issues with psm 1 and 3 with some of the traineddata from _fast.

Shreeshrii commented 6 years ago

Do you think removing --oem help text completely from the tesseract --help would do the job?

You would also need to remove it from the single option

--help-oem

Also user patterns and user words are not supported in 4.0.

Maybe a better approach will be to catch all these at initialization and give error message saying these are not supported, and also suggesting the alternate which works.

zdenop commented 6 years ago

And what about some workaround. Something like this:

diff --git a/api/tesseractmain.cpp b/api/tesseractmain.cpp
index 783e2627..4a30ae21 100644
--- a/api/tesseractmain.cpp
+++ b/api/tesseractmain.cpp
@@ -139,9 +139,9 @@ void PrintHelpForPSM() {
 void PrintHelpForOEM() {
   const char* msg =
       "OCR Engine modes:\n"
-      "  0    Original Tesseract only.\n"
+      "  0    Original Tesseract only (unsuppored).\n"
       "  1    Neural nets LSTM only.\n"
-      "  2    Tesseract + LSTM.\n"
+      "  2    Tesseract + LSTM (unsuppored).\n"
       "  3    Default, based on what is available.\n";

   printf("%s", msg);
@@ -308,8 +308,14 @@ void ParseArgs(const int argc, char** argv, const char** lang,
       *pagesegmode = static_cast<tesseract::PageSegMode>(atoi(argv[i + 1]));
       ++i;
     } else if (strcmp(argv[i], "--oem") == 0 && i + 1 < argc) {
-      checkArgValues(atoi(argv[i+1]), "OEM", tesseract::OEM_COUNT);
-      *enginemode = static_cast<tesseract::OcrEngineMode>(atoi(argv[i + 1]));
+      int oem = atoi(argv[i + 1]);
+      checkArgValues(oem, "OEM", tesseract::OEM_COUNT);
+      if (oem == tesseract::OEM_TESSERACT_ONLY ||
+          oem == tesseract::OEM_TESSERACT_LSTM_COMBINED) {
+        printf("Legacy OCR Engine is not supported anymore.\n");
+        exit(2);
+      }
+      *enginemode = static_cast<tesseract::OcrEngineMode>(oem);
       ++i;
     } else if (strcmp(argv[i], "--print-parameters") == 0) {
       noocr = true;
diff --git a/ccstruct/publictypes.h b/ccstruct/publictypes.h
index c23cd269..9416ce95 100644
--- a/ccstruct/publictypes.h
+++ b/ccstruct/publictypes.h
@@ -266,10 +266,11 @@ enum ParagraphJustification {
  * mention the connection to OcrEngineMode in the comments.
 */
 enum OcrEngineMode {
-  OEM_TESSERACT_ONLY,           // Run Tesseract only - fastest
+  OEM_TESSERACT_ONLY,           // Run Tesseract only - fastest; depreciated
   OEM_LSTM_ONLY,                // Run just the LSTM line recognizer.
   OEM_TESSERACT_LSTM_COMBINED,  // Run the LSTM recognizer, but allow fallback
                                 // to Tesseract when things get difficult.
+                                // depreciated
   OEM_DEFAULT,                  // Specify this mode when calling init_*(),
                                 // to indicate that any of the above modes
                                 // should be automatically inferred from the

It should not break anything (e.g. legacy engine is available via API and it does not allow common user to use legacy engine)...

zdenop commented 6 years ago

Should I commit this change to master? Will it help?

jimregan commented 6 years ago

s/unsuppored/unsupported/ s/depreciated/deprecated/

amitdo commented 6 years ago

@zdenop, that was my original request/suggestion.

Until Ray actually removes the code, I suggest we just change the command line program to use lstm mode only.

amitdo commented 6 years ago

@jimregan,

but Zdenko's patch actually removes the non-lstm support from the command line (but not the API). .

jimregan commented 6 years ago

@amitdo huh? I was just pointing out typos in the patch

amitdo commented 6 years ago

Sorry, stupid mistake.

zdenop commented 6 years ago

commited as 173ad2bd0044c82973fe17b6f077e7202602cb99

stweil commented 6 years ago

I think that the legacy engine was removed a little bit too early. @theraysmith said he would keep it until LSTM is a really full replacement. Currently, even "best" LSTM does not achieve the same recognition rate as the legacy engine for selected texts. And LSTM does not recognize any of the text attributes which the legacy engine does.

I would not mind adding a configure option, so people can build a Tesseract without legacy support. From Debian, I expect a tesseract-ocr package which supports both engines.

May I send a PR which restores support for the legacy engine and modifies the help text to avoid confusions for less advanced users?

zdenop commented 6 years ago

@stweil: Yes, of course you can make a better PR...

Shreeshrii commented 6 years ago

From Debian, I expect a tesseract-ocr package which supports both engines.

@stweil Please see https://launchpad.net/ubuntu/+source/tesseract

The code committed in Debian is - 4.00~git2188-cdc35338-5

Will these new patches be propagated there?

stweil commented 6 years ago

Yes, it still supports both engines. My comment was mainly for the Debian package maintainers: Please do not use the current git master as long as it disables the legacy engine.

stweil commented 6 years ago

@stweil: Yes, of course you can make a better PR...

I try to address it in pull request #1325.

Shreeshrii commented 6 years ago

@stweil

I will request you to add info specifying the different types of output available using config files for

txt pdf tsv hocr invisible-pdf (please check for correct syntax)

Something on lines of:

Output Options (using configfile):

txt                         output utf-8 output text (default)  
pdf                         output searchable pdf with original image and invisible text layer
hocr                        output HOCR HTML files
tsv                         output TSV file

etc
Shreeshrii commented 6 years ago

@stweil

Additional request,

when using --list-langs, display the path from where the traineddata files are being accessed.

stweil commented 6 years ago

What about radically simplifying the tessdata lookup first? Remove support for environment variable TESSDATA_PREFIX (support default and --tessdata-dir). Remove parameter m_data_sub_dir. Don't try directory with and without tessdata/ appended. I think all these "features" contribute to the confusion of the tesseract users.

Shreeshrii commented 6 years ago

@stweil Excellent suggestion. It would really make it easier for users.

stweil commented 6 years ago

The first step is done in #1328.

theraysmith commented 6 years ago

Jeff asked me to comment on this thread. Here's where my research stands: The new LSTM osd model is already there. Its training accuracy was great. I have some experimental code that used it. I stopped work on the project partly because I ran out of time, but also because I wanted feedback on the success of the Script models. I'm not sure how complete it was. It requires work to the layout analysis code, and some top-level code to select the script.

If the script models obtain satisfactory accuracy (to those who want an automatic OSD mode), then we can go ahead with the plan to use the LSTM-based OSD detector, and use that to pick the script, then run in that script.

I assume that 4.00 is going out without this change, but after it is released, it would be the obvious next thing for me to do. Then we can finally cut the cord and jettison the old code.

amitdo commented 6 years ago

https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375027879

theraysmith commented on Mar 21

I did have an idea for a better multi-language implementation that would cleanly use models from multiple languages at once, but that depends on getting rid of the old code, and moving the multi-language functionality into the beam search. Until the old code is gone, that would be very messy.

amitdo commented 6 years ago

I would not mind adding a configure option, so people can build a Tesseract without legacy support.

This can be done with the help of a code coverage tool. https://clang.llvm.org/docs/SourceBasedCodeCoverage.html

I was able to compile tesseract in Debian 9 system with the right flags and to produce a coverage report for the whole program in html format, one page per source file.

amitdo commented 6 years ago

I was able to remove ~36k lines of code so far (It's not yet in a public repo).

amitdo commented 6 years ago

You can add to this number 4350 more lines that can be removed by getting rid of the useless OpenCL code.

amitdo commented 6 years ago

I succeed in dropping the legacy engine by dropping ~37K LOC.

I now work to convert it to an '#ifdef' version. Expected to be ready for PR within a few days.

Shreeshrii commented 6 years ago

Have you isolated code regarding use of correct unicharset within dawg processing?

amitdo commented 6 years ago

What do you mean, Shree?

Shreeshrii commented 6 years ago

The legacy engine should use lang.unicharset while lstm engine should use lang.lstm-unicharset.

With in the dawg processing there is some bug and wrong unicharset gets used.

amitdo commented 6 years ago

With the new option --disable-legacy, It will work as the current way with oem 1 (lstm only). Other OEMs will be disabled.

Any psm with osd will be disabled with this compilation option, because the osd code depends on the legacy engine. Ray said he has a solution to osd for lstm.

Shreeshrii commented 6 years ago

Ray had initially posted the osd traineddata for 4.00.00alpha with lstm. Then because of some errors it was changed to the 3.04 version.

You can check if that osd works with your non-legacy version.

On Sun 1 Jul, 2018, 12:18 PM Amit D., notifications@github.com wrote:

With the new option --disable-legacy, It will work as the current way with oem 1 (lstm only).

Any psm with osd will be disabled with this compilation option, because the osd code depends on the legacy engine. Ray said he a has a solution to osd for lstm.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/707#issuecomment-401587092, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxG0w2UsMeZmAQBaxu6SL35zCObIks5uCHDDgaJpZM4L50TV .

amitdo commented 6 years ago

It won't work because the code that handles it is not in the public GitHub repo.

amitdo commented 6 years ago

https://github.com/tesseract-ocr/tesseract/pull/1740 (merged)

amitdo commented 6 years ago

@theraysmith, will the LSTM-OSD be implemented using the method described in this paper?

https://www.researchgate.net/publication/280777013_A_Sequence_Learning_Approach_for_Multiple_Script_Identification

makmanalp commented 6 years ago

I know it's unlikely to stay, but I'd like to voice my support for style detection (and thus, perhaps, the old engine). In my research group we're working on OCRing very old government documents and parsing structured data out of them, and the reason we care is that often style (font size, bold, italic) is the only reasonable way to tag specific bits of information.

amitdo commented 6 years ago

https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-327814244

amitdo commented 6 years ago

Font size estimation is already supported for the lstm engine.

makmanalp commented 6 years ago

@amitdo er, should have excluded size from there, I mean style.

amitdo commented 6 years ago

Did you noticed my other comment above (a link to Ray's comment) ?

amitdo commented 5 years ago

So it seems that the legacy engine will stay in final 4.0.0.

I still have a question:

Has someone done serious testing on hundreds of pages to compare the result of: 1) legacy vs. lstm 2) lstm+legacy vs. lstm alone

- Just the text renderer with default psm (auto, no osd).

I'm interested in char and word error (CER & WER) statistics.

stweil commented 5 years ago

I am just running accuracy tests on 189 pages from our historic books. Currently I have results from ABBYY Fine Reader and Tesseract with fast Fraktur and PSM 1, 3 and 6. I also tested ScanTailor + Tesseract fast Fraktur PSM 1. Best Fraktur is just running. First results for CER median:

ABBYY: 10.5 % PSM 1: 13.0 % PSM 3: 15.0 % PSM 6: 20.2 % ScanTailor + PSM 1: 12.8 %

Detailed results will be published as soon as the tests are finished.

stweil commented 5 years ago

lstm+legacy is currently not usable for mass production, because chances are high that Tesseract will fail because of the well known assertion.

amitdo commented 5 years ago

Thanks for sharing!

What preprocessing options are used with ScanTailor?

I think ABBYY also has 'fast' and 'accurate' modes.

stweil commented 5 years ago

ScanTailor was used with scantailor-cli --color-mode=mixed --dewarping=auto.

ABBYY used the default mode with different language settings. I still have to look for effects caused by different handling of diacritica (for example Latin ground truth and ABBYY result without accents, but original text uses accents => Tesseract Fraktur detects accents).

The raw data is at https://digi.bib.uni-mannheim.de/~stweil/anciendroit/new/.

amitdo commented 5 years ago

I decided to close this issue.

There is now an option to compile Tessseract 4.0.0 without the legacy engine code.

amitdo commented 3 years ago

I succeed in dropping the legacy engine by dropping ~37K LOC.

It's more than 64k LOC now.

amitdo commented 3 years ago

@stweil

Can you run the accuracy tests again on the same dataset with master and/or latest tagged version, to make sure there is no regression?

stweil commented 3 years ago

Here are the results with latest Tesseract for the line images posted by @uvius. They are still not perfect, but much better than the old ones, especially with a new model which I recently have trained (based on ground truth published by @uvius and others).

--oem 1 --psm 7 -l tessdata/lat

17 :
V.
SECVNDAE
B 3
LIBER
AD
LxxxvIIL
20 PROGYMNASMATA
IN GENEROSVM ADOLESCEN-
caftris millia pafluum circiter feptem. R ex cum hoc itinere Cafíaré uenire

--oem 1 --psm 7 -l ubma/frak2021_0.905_1587027_9141630

17
V.
SECVNDAE
B 3
LIBER
AD
LXXXVIII.
20 PROGYVMNASMATA
IN GENEROSVM ADOLES CEN-
caſtris millia paſſuum circitèr ſeptem. Rex cum hoc itinere Cæſarẽ uenire
stweil commented 3 years ago

Can you run the accuracy tests again on the same dataset with master and/or latest tagged version, to make sure there is no regression?

tesseract-5.0.0-alpha-592-gb221f has only few different results on the 189 files of my test set, but some of those are significantly better (-=old, +=new):

-452114306_0024  85.67%  Accuracy
+452114306_0024  85.57%  Accuracy
-452117542_0022  20.94%  Accuracy
+452117542_0022  87.72%  Accuracy
-461732149_0012  89.80%  Accuracy
+461732149_0012  89.44%  Accuracy
-461732149_0158  15.09%  Accuracy
+461732149_0158  81.48%  Accuracy
-470857285_0979  93.70%  Accuracy
+470857285_0979  93.77%  Accuracy
-470875348_0608  89.93%  Accuracy
+470875348_0608  89.97%  Accuracy
-470901101_0034  86.81%  Accuracy
+470901101_0034  86.86%  Accuracy

Git master produces different results, some of them slightly worse, some are better. The most significant change with latest Tesseract is the time required to process the 189 pages. It dropped from 1638 s to 926 s. I think that both effects are caused by commits cfb1fb2540ca2fb899d6ecf68ed9d328bed9e91e and eaf72ace3115aca7421c8c9ed34108e15360cf12.