tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
59.96k stars 9.27k forks source link

RFC: Remove the legacy OCR Engine #707

Closed amitdo closed 5 years ago

amitdo commented 7 years ago

Ray wants to get rid of the legacy OCR engine, so that the final 4.00 version will only have one OCR engine based on LSTM.

From #518:

@stweil commented:

I strongly vote against removing non-LSTM as we currently still get better results with it in some cases.

@theraysmith commented:

Please provide examples of where you get better results with the old engine. Right now I'm trying to work on getting rid of redundant code, rather than spending time fighting needless changes that generate a lot of work. I have recently tested an LSTM-based OSD, and it works a lot better than the old, so that is one more use of the old classifier that can go. AFAICT, apart from the equation detector, the old classifier is now redundant.

Shreeshrii commented 7 years ago

Is the intention that the legacy OCR engine will be available in 3.0x branch and LSTM engine in the 4.0 version?

harinath141 commented 7 years ago

I support @theraysmith removing the legacy OCR engine as we are getting better results in LSTM-based, however we have to increase support to multilanguage and need many fixes to 4.0 final..

amitdo commented 7 years ago

My personal opinion is that we should drop the old engine. It will be much easier to maintain and support Tesseract in this form. I also support dropping the OpenCL code.

amitdo commented 7 years ago

I also think we should release a last 3.0x version in the upcoming 2-6 weeks.

egorpugin commented 7 years ago

+ for dropping (in case of better results of lstm engine)

atuyosi commented 7 years ago

I cannot agree with removing old ocr engine, until new lstm engine has support vertical text.

Of course I know that the new LSTM engine is very good ( in Japanese text including English words especially). In the meantime, maintaining the old engine provides the option of using the old OCR engine only for vertical text.

c.f. #627 , #641

theraysmith commented 7 years ago

It will support vertical text. I have an experimental implementation that treats it as an additional language, but it would be possible to make it depend on the layout analysis instead.

On Wed, Feb 8, 2017 at 6:18 AM, Atsuyoshi SUZUKI notifications@github.com wrote:

I cannot agree with removing old ocr engine, until new lstm engine has support vertical text.

Of course I know that the new LSTM engine is very good ( in Japanese text including English words especially). In the meantime, maintaining the old engine provides the option of using the old OCR engine only for vertical text.

c.f. #627 https://github.com/tesseract-ocr/tesseract/issues/627 , #641 https://github.com/tesseract-ocr/tesseract/issues/641

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/707#issuecomment-278340436, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056SgL-wwCYGswOMqxb7FNmr6OIYonks5rac68gaJpZM4L50TV .

-- Ray.

zdenop commented 7 years ago

If 3.05 should be the last version with legacy OCR Engine (old engine) then there should be possibility to read OCR result from memory.

Also it would be great if 3.05 and 4.0 version could be installed at the same time (AFAIK there are conflict with tessdata filenames: they are the same but they are not compatible)

amitdo commented 7 years ago

:+1: for a side-by-side 3.05 and 4.00.

A possible way to achieve this goal: For 3.05 you can append 3 to libtesseract and all the installed programs. The traineddata will live in .../share/tessdata3.

zdenop commented 7 years ago

I would prefer to be as much consistent as possible: e.g. if 3.02 and 3.04 use tessdata also 3.05 should. So 4.0 should start with change...

egorpugin commented 7 years ago

Yes, if later we'll have 5.0 with different data files, they'll use tesseract5 and this won't break anything. If we have tesseract for 4.0, then it will be renamed to tesseract4 again, and tesseract for 5.0 - that's not good.

Shreeshrii commented 7 years ago

I agree with zdenop

tessdata should be used for the 3.0x series, so as to not break any existing use

New naming can be used for LSTM 4.0

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Feb 9, 2017 at 12:04 AM, Egor Pugin notifications@github.com wrote:

Yes, if later we'll have 5.0 with different data files, they'll use tesseract5 and this won't break anything. If we have tesseract for 4.0, then it will be renamed to tesseract4 again, and tesseract for 5.0 - that's not good.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/707#issuecomment-278419744, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ozcExn8t5DzsPgjUAqnju9_CZk3Mks5ragqdgaJpZM4L50TV .

stweil commented 7 years ago

A simple solution could be using tessdata/4, tessdata/5 and so on for new major versions, so we continue using a tessdata directory at the same location as before, but automatically add the major version as the name of a subdirectory. If Tesseract uses semantic versioning in the future, I see no need to add a second number (although that would be possible, resulting in tesseract/4.0).

For the program names, we can look for existing examples. I just checked my /usr/bin/*[0-9] files and found names like clang-3.8, gcc-6, php5, php-7.0, ruby2.1. So there is no clear convention whether to separate name and version by a dash or not and whether to use major version only or both major and minor version. With semantic versioning the major version should be sufficient again.

theraysmith commented 7 years ago

I'm thinking of using the same traineddata file format for 4.0, but adding some new subfiles, including a version string, as has been requested. The LSTM-only engine would then store the unicharset, recoder and dawgs as separate traineddata components, also satisfying the need to get at the unicharset. With an additional subfile to store the trainer-specific data, it should be possible use the traineddata file format as a checkpoint format during training, which gets rid of a layer of complexity. I had thought of going with a different filename extension, but the versioned subdir seems like a good idea too.

In any case, we should roll back the existing traineddata files for 3.05.

On Wed, Feb 8, 2017 at 10:54 AM, Stefan Weil notifications@github.com wrote:

A simple solution could be using tessdata/4, tessdata/5 and so on for new major versions, so we continue using a tessdata directory at the same location as before, but automatically add the major version as the name of a subdirectory. If Tesseract uses semantic versioning in the future, I see no need to add a second number (although that would be possible, resulting in tesseract/4.0).

For the program names, we can look for existing examples. I just checked my /usr/bin/*[0-9] files and found names like clang-3.8, gcc-6, php5, php-7.0, ruby2.1. So there is no clear convention whether to separate name and version by a dash or not and whether to use major version only or both major and minor version. With semantic versioning the major version should be sufficient again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/707#issuecomment-278425371, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056UcVHoxoH9rE_svO5yvu4FrJSrGVks5rag9PgaJpZM4L50TV .

-- Ray.

jbreiden commented 7 years ago

https://wiki.ubuntu.com/ZestyZapus/ReleaseSchedule

Feb 16 is the final deadline for changess to Ubuntu 17.04. I am not comfortable shipping anything from 4.x to these users, but we can consider taking a snapshot of the 3.0.5 branch. It does have some bug and compatibility fixes that are good for users. Regarding training data, I would not ship an update that at all. This would be purely be a code update.

I know the long standing issue has been restoring an API call (last seen in version 3.0.2) to send results to memory instead of file. I respect that idea, but we don't have it, and it's not that easy to add. I think it is fair to say that it would be impossible before deadline. So the question is, do we ship an update to users this cycle or not. And if so, should I take a snapshot? And if so, what would it be called?

A few more thoughts that are somewhat related

Shreeshrii commented 7 years ago

@jbreiden Good idea to do a code update for 3.05 for Ubuntu 17.04. There are a number of bug fixes and changes and it would be good to get them out to the users. Thanks!

uvius commented 7 years ago

@theraysmith: Here I try to provide examples of where you get better results with the old engine.

I did a lot of LSTM training with OCRopus on real images of historical printings and noticed that LSTM recognition was inferior to classic Tesseract in these cases:

  1. glyphs rarely seen in training (capital letters, numbers, certain punctuations)
  2. unusual patterns (letter-spacing, e.g. R U N N I N G H E A D)
  3. very short lines (catchword at page end, page numbers)

My explanation is, that single letters get decoded using the combined evidence of the whole line. If this is either rare and unusual (1, 2) or mostly absent (3), decoding is uncertain, no matter how clearly single glyphs are printed and preserved (and therefore easily recognized by methods based on feature detection).

So I tried both the old (OEM = 0) and new (OEM = 1) recognizer on these 10 lines (the last line is a regular text line for comparison from a 1543 printing, where a trained model yields 99.5% accuracy for the book):

1 bin 2 bin 3 bin 4 bin 5 bin 6 bin 7 bin 8 bin 9 bin 10 bin

Old method: tesseract -l lat --oem 0 --psm 7:

17: V. SECVNDAE B 3 LIBER AD Lxxxvxn. zo PROGYMNASMATA IN GENEROSVM ADOLESCEN- cafiris millia paITuum circitér fcptem.Rc_x cum hoc itincrc szaré ucnirc

New method: tesseract -l lat --oem 1 --psm 7:

177: V,. SECV NDAHE B- 5 LI B E D. A D Lx x XV II IL. 209 P R o cy M N ^ s M A T 4 IN GE NE R O SVM A D O L E S CE N- caüris millia paiTuum circiter fcptcm.Rc-x cum hoc itinere Cæfarö uenit:

Admittedly, although this is all Latin text, the recognition looks much better without any language model (tesseract --oem 1 --psm 7):

17; V. SECV NDA E B ; LIB E R A D Lx X xv III. 40 PR 0 GY MN A S M A T a IN GENEROSVM ADOLESCEN. caftris millia pafluum circiter feptem. Rex cum hocitinere Cafaré uenire

But it still is less consistent than the old method in treating spacings. The last line shows the potential that may be reached when training on real images becomes available (long ſ, proper inter-word spacing model, historical glyphs).

So I vote for keeping the old code just for these edge cases which are otherwise hard to recognize at the same level of consistency.

amitdo commented 7 years ago

Admittedly, although this is all Latin text, the recognition looks much better without any language model

Without explicit -l LANG, Tesseract will use the eng traineddata, so

tesseract --oem 1 --psm 7

is equivalent to:

tesseract -l eng --oem 1 --psm 7

amitdo commented 7 years ago

@theraysmith commented in commit b453f74e01

There is always going to be a significant speed penalty for multi-lang mode. The multi-lang mode could still do with more work to run it at a lower level, (inside RecognizeLine) but the legacy engine could do to go before that, or multi-lang could get really unnecessarily complex.

solomennikm commented 7 years ago

I think that there is reason to keeping the old ocr engine while LSTM engine is not ideal. This will allow use two engine simultaneously. For example ABBYY uses several ocr methods in his OCR engine: Bayesian classifier with about 100 features, raster classifier, contour classifier, structure classifier and then differentiating classifiers.

amitdo commented 7 years ago

The problem is that the code for the old engine is too large and complex. As Ray indicated, keeping it will make improving the new LSTM engine much harder.

zdenop commented 7 years ago

So there should be changes in 4.0 code so tesseract 4.x and 3.05.x could be installed at the same time

egorpugin commented 7 years ago

Yes, who wants old engine could use 3 series when lstm will be available in 4.

Shreeshrii commented 7 years ago

https://github.com/tesseract-ocr/tesseract/issues/733

single letters recognized better with legacy

amitdo commented 7 years ago

From #744 theraysmith commented:

... yes I would still like to remove the old classifier and take out a lot of code with it. I'm going to review the replies to my request for "old better than new", and thanks to those that provided them, with a view to making new better than old on those problems.

amitdo commented 7 years ago

From 518 theraysmith commented:

Please provide examples of where you get better results with the old engine.

@stweil commented 29 days ago:

I'll do that in the discussion of the new issue #707.

Stefan, we are still waiting for it ... :-)

stweil commented 7 years ago

I can confirm all problems reported above by @uvius. In addition, some training files currently only exist for 3.x (notably deu_frak) or have a bad quality (deu), so 4.0 does not improve the results for those languages.

I also had an example where a larger part of a page was missing in the output from LSTM while the old recognizer got most of that part correctly, but I am still searching to find that example again.

amitdo commented 7 years ago

As you know, unlike almost all the other files in the tessdata repo, the '_frak' traineddata files are not based on Google manpower (& machine-power) efforts. https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#fraktur-data-files

Maybe you and and your friends from @UB-Mannheim can prepare a new deu_frak traineddata for Tesseract 4.00 and share it under open source license (preferably Apache 2 or other permissive software license) ?

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tesseract/issues/681#issuecomment-275801448 Just a reminder for Ray.

amitdo commented 7 years ago

From tesseract-ocr/langdata issue 59

theraysmith commented

I'm also going to fix the single char/single word issue that was raised as an objection to deleting the legacy engine.

stweil commented 7 years ago

I recently tried to improve the training model for a language (frk). That's rather easy and does not need much compute time (~ minutes) or other resources for the old engine. Especially adding more characters which should be recognized is a simple task as soon as the general infrastructure (Tesseract binaries, small number of fonts) is available.

For the new LSTM engine, this is totally different. As far as I know it is not possible to add a missing character to an existing trained LSTM, so new training from scratch is needed. This requires a lot of resources (much more training text, a huge number of fonts, compute time ~days / ~weeks) and cannot be done by most users. Maybe @theraysmith or users who have successfully trained LSTM can provide more detailed numbers.

My conclusion is that most users of the new LSTM will be restricted to the available trained data either from @theraysmith or from third parties. If the old engine is removed, it will no longer be possible to optimize OCR for documents with unusual or rare characters. Calling Tesseract with more than one language can only partially solve such situations.

stweil commented 7 years ago

LSTM currently does not work with all languages (see issue #682). That's related to my previous comment: adding (good) LSTM support for a language is much more difficult than for the old engine. Of course the existing languages will be fixed one day, but there still remain more exotic languages which are not covered today, and people won't be able to add them to Tesseract. We could tell users to use Tesseract 3.x for those cases, but would that really save development resources when there is the need to maintain both versions 3 and 4? It seems clear that having two major versions of Tesseract requires more work for Linux distributions. Neither is it a good solution for users who have to install and use both versions and don't know what to do when texts require both versions.

theraysmith commented 7 years ago

I believe I am on a path to make the LSTM engine work with many languages, and possibly unseen languages in the same script. I agree that training from scratch is much more difficult than for the old engine, but I think the obtainable accuracy makes it worth leaving the old engine behind. I think that the fine tuning and/or replacing just one layer training may be adequate for adding new fonts or new characters, with a bit more work on my part. A big part of my desire to drop the old engine is that it would enable a much better solution to multi-lang/multi-script and plug-replaceable language models. While the legacy engine remains in place, there is a lot of work-around to do to integrate the LSTM engine that does not fit with my ideas for fixing this problem properly.

On Wed, Mar 15, 2017 at 2:42 AM, Stefan Weil notifications@github.com wrote:

LSTM currently does not work with all languages (see issue #682 https://github.com/tesseract-ocr/tesseract/issues/682). That's related to my previous comment: adding (good) LSTM support for a language is much more difficult than for the old engine. Of course the existing languages will be fixed one day, but there still remain more exotic languages which are not covered today, and people won't be able to add them to Tesseract. We could tell users to use Tesseract 3.x for those cases, but would that really save development resources when there is the need to maintain both versions 3 and 4? It seems clear that having two major versions of Tesseract requires more work for Linux distributions. Neither is it a good solution for users who have to install and use both versions and don't know what to do when texts require both versions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/707#issuecomment-286690362, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Qz-VC6VezMBvacI2nBojn_NJ8FAks5rl7JqgaJpZM4L50TV .

-- Ray.

Shreeshrii commented 7 years ago

There are community training projects for MICR and SSD but these are not included in upcoming training by Ray. Just documenting differences between 3.0x and 4.0 ... listed as issues in langdata:

https://github.com/tesseract-ocr/langdata/issues/65

https://github.com/tesseract-ocr/langdata/issues/64

Shreeshrii commented 7 years ago

Based on my testing, I agree with @stweil - adding (good) LSTM support for a language is much more difficult than for the old engine.

Most recently I tried creating traineddata for Armenian based on a request in the forum - see https://github.com/tesseract-ocr/langdata/issues/67

While with my limited fonts and training text, I was able to get a legacy version of traineddata within a few hours which had reasonable accuracy, with the same inputs and 3-4 days of processing, the lstm version of traineddata did not better the accuracy, took more time while OCRing the same file. Of course, my test sample is very limited.

On the other hand, the accuracy and speed of complex scripts such as Devanagari has improved with the LSTM traineddata (though I haven't been able to add a top layer or fine tune those because of unicharset limitations).

I hope the new codebase and traineddata will address these issues. Thanks!

stweil commented 6 years ago

As discussed in issue #1074, currently only the old recognizer is able to detect text attributes like font size, bold, italic.

amitdo commented 6 years ago

@theraysmith, Can you please update us on your plans?

When will 4.00 final version be released? Will it include the legacy engine?

Thank you.

amitdo commented 6 years ago

I strongly suggest to remove the legacy engine. It just invites more issues.

Until Ray actually removes the code, I suggest we just change the command line program to use lstm mode only. @jbreiden, please do this for Ubuntu 18.04.

stweil commented 6 years ago

As the new Tesseract packages only provide language models from tessdata_fast, normal users will already get LSTM only.

amitdo commented 6 years ago

I wanl to reduce users' confusion, and to no longer get reports like

1205 and #1308.

amitdo commented 6 years ago

The Ubuntu 18.04 users will see oem 0 and 2 with --help. Some of them will try these modes and will get a crash.

stweil commented 6 years ago

I think that an assertion is always something which should be fixed in the code, so the right solution would be fixing the code. That's what I did in the past, and I'll also do it here as soon as I can reproduce a problem.

amitdo commented 6 years ago

Try to run tesseract with oem 0 or 2 and fast/best data.

jbreiden commented 6 years ago

We can't remove the legacy recognizer because the orientation script detector (OSD) requires it. That's too bad, because few things would make me happier than escaping the maintenance burden of 100K lines of code.

amitdo commented 6 years ago

The OSD module uses a small (?) part of the legacy code.

jbreiden commented 6 years ago

Until Ray actually removes the code, I suggest we just change the command line program to use lstm mode only. @jbreiden, please do this for Ubuntu 18.04.

@amitdo Are you sure about this? Another choice is adjusting the documentation.

OCR Engine modes: (see https://github.com/tesseract-ocr/tesseract/wiki#linux)
  0    Original Tesseract only.
  1    Neural nets LSTM only.
  2    Tesseract + LSTM.
  3    Default, based on what is available.
amitdo commented 6 years ago

Well, I just expressed my strong preference as someone who responds to a large percent of issues here.

But it's OK if you and others think differently than me and have another solution to the issue.

jbreiden commented 6 years ago

I'm only suggesting the documentation approach because it might both prevent bug reports, and be helpful to the relatively few advanced users who want to try alternative training data. Do you think it will work to prevent bug reports?

amitdo commented 6 years ago

Most people do not bother to read the documentation (true for any project, not just this one). I don't know if that link from the command line to the wiki will help.

jbreiden commented 6 years ago

Do you think removing --oem help text completely from the tesseract --help would do the job?

$ tesseract --help
Usage:
  tesseract --help | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
NOTE: These options must occur before any configfile.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
            bypassing hacks that are Tesseract-specific.

Single options:
  -h, --help            Show this help message.
  --help-psm            Show page segmentation modes.
  --help-oem            Show OCR Engine modes.
  -v, --version         Show version information.
  --list-langs          List available languages for tesseract engine.
  --print-parameters    Print tesseract parameters.