tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.15k stars 9.5k forks source link

Tag a new version for LSTM 4.0 #995

Closed Shreeshrii closed 6 years ago

Shreeshrii commented 7 years ago

Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.

@zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!

WilliamTambellini commented 6 years ago

Hi everybody Looks like the dec 15 release is/was a good milestone and at least a good test on Ubuntu 18. What about now creating a "4.0.0beta" tag ? Kind

jbreiden commented 6 years ago

The package version number is 4.00~git2188-cdc35338-2 so that's commit cdc35338. Maybe give it a little time to settle? We had a critical bug the other day, but that turned out to be in Leptonica.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=885704

WilliamTambellini commented 6 years ago

Hi everybody Looks like the dec 15 release is/was a good milestone and at least a good test on Ubuntu 18. What about now creating a "4.0.0beta" tag ? Kind

amitdo commented 6 years ago

https://wiki.ubuntu.com/BionicBeaver/ReleaseSchedule

January 11th | Alpha 1 February 1st Alpha 2 March 1st FeatureFreeze, Debian Import Freeze March 8th Beta 1 Freeze April 5th Final Beta Freeze , Final Beta April 19th FinalFreeze, ReleaseCandidate April 26th FinalRelease, Ubuntu 18.04

stweil commented 6 years ago

The suggested tag would be 4.0.0-beta.20180105 (for today), see the discussion above.

WilliamTambellini commented 6 years ago

Well I dont really care about the name of the tag as long as one is created soon. Reminder: a tag is 'just' a tag (it's not a branch), just super convenient to compare between different milestones.

Shreeshrii commented 6 years ago

@jbreiden

Jeff, Would it be possible for you to update the langdata repository to match the 4.00alpha tessdata files, on behalf of @theraysmith ? It would help out those who are trying to finetune traineddata for their specific languages. Thanks!

edit: It will also address the requirement of debian regarding language source files. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=699609

stweil commented 6 years ago

Updating the langdata files would also help to identify and fix systematic bugs for future trainings.

jbreiden commented 6 years ago

@Shreeshrii @stweil Okay, I'll investigate. Will probably take some time.

amitdo commented 6 years ago

@jbreiden

How are you going to support SIMD with the debian/ubuntu binary package? https://lists.debian.org/debian-mentors/2017/03/msg00163.html

jbreiden commented 6 years ago

I didn't do anything special with packaging for SIMD. Basically shipped the code as packaged by Alexander and am basically waiting for the bug reports to start rolling in. Can someone remind me which processors are going to have trouble, and what will that look like for a user perspective at runtime? (I forgot that my Pentium G4560 chip was useful for testing such things, so I gave it to young child to play with.)

stweil commented 6 years ago

Normally all kinds of processors should work, because Tesseract tests at runtime whether the CPU supports AVX2 or SSE and chooses the right code automatically.

Problems occurred in the past with virtual machines which claimed to support AVX2 but did not do so. That case needs an improved runtime test (which is still missing) to work.

amitdo commented 6 years ago

There is also a build time detection which adds -msse4.2 / -mavx / -mavx2 flags.

jbreiden commented 6 years ago

The build time detection is just about the compiler, right? I built the X86_64 package on an Intel Xenon E5-1650 which does not have AVX2. But that's fine and doesn't hurt anyone. Right? Right? It's got to be right.

checking whether C++ compiler accepts -mavx... yes
checking whether C++ compiler accepts -mavx2... yes
checking whether C++ compiler accepts -msse4.1... yes
stweil commented 6 years ago

Yes, that's perfect. You can then run tesseract -v to see which SIMD instructions were detected for your CPU.

amitdo commented 6 years ago

Here is the actual script it uses: https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html

You can then run tesseract -v to see which SIMD instructions were detected for your CPU.

This is done by the runtime detection.

amitdo commented 6 years ago

I think that when you use a flag like -msse4.2 the compiler can automatically use sse4.2 instructions anywhere in the code. The sse4.2 code will cause SIGILL in machines that lack sse4.2 instructions.

amitdo commented 6 years ago

... but Tesseract does not use these flags globally. It uses them only in arch/Makefile.am.

I hope that this approach is enough to save you from the above issue.

amitdo commented 6 years ago

@jbreiden

https://packages.debian.org/sid/tesseract-ocr

Tesseract command line OCR tool

The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google and is probably one of the most accurate open source OCR engines available. It can read a wide variety of image formats and convert them to text in over 40 languages. This package includes the command line tool.

I suggest to change it to something like this:

The Tesseract OCR engine was originally developed by HP between 1985 and 1998. Since 2006 it has been developed as an open source project by Google. It can read a wide variety of image formats and convert them to text. It supports over 120 languages.

This package includes the command line tool.

jbreiden commented 6 years ago

How about I copy the description in the Wiki or README file? (And should they be synchronized?)

Tesseract is an open source Optical Character Recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract typed, handwritten or printed text from images. It supports a wide variety of languages.

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

amitdo commented 6 years ago

No problem. I mainly dislike the '23 years ago it was one of the best'. WHO CARES? (Sorry Ray).

That's why I removed it from the README.

jbreiden commented 6 years ago

Why is the README talking about handwriting? Tesseract is terrible at handwriting.

amitdo commented 6 years ago

You mean the wiki Home, not the README.

I fixed it.

stweil commented 6 years ago

WHO CARES?

Does anybody care that the Wiki (on GitHub, but also the English Wikipedia) still says Optical Character Recognition although the old Tesseract detects glyphs and the new Tesseract detects lines of text? Would the following text be better?

Tesseract is an open source text recognition ("Optical Character Recognition" = OCR) engine [...]

amitdo commented 6 years ago

The term for what Tesseract is doing is OCR. Even if not accurate, that's the term people recognize. It's not our job to invent new terms.

stweil commented 6 years ago

Sure. OCR is also part of the GitHub repository name. I do not want to change that. That's why it is still part of the new text which I suggested.

amitdo commented 6 years ago

https://www.google.co.il/search?q=%22text+recognition%22

It seems that the term 'text recognition' is commonly used as replacement for 'OCR' :)

stweil commented 6 years ago

Yes. It is quite common that abbreviations live much longer than their original meaning, so that original meaning remains only relevant for encyclopedias. Example: search for 'machines' on the IBM website. You won't find that word, although the 'M' is still part of the name.

amitdo commented 6 years ago

I still sometimes use the term 'machine' to refer to a computer. Maybe I'm too old (or just a geek?). :-)

amitdo commented 6 years ago

https://github.com/tesseract-ocr/tesseract/wiki/Home/_compare/f2546f61f52071...de992eeb5b252d

jbreiden commented 6 years ago

Here's what will ship with Ubuntu 18.04. Tag (or don't tag) as you see fit.

 Tesseract is an open source Optical Character Recognition (OCR)
 Engine. It can be used directly, or (for programmers) using an API to
 extract printed text from images. It supports a wide variety of
 languages. This package includes the command line tool.
$ dpkg -l tesseract-ocr tesseract-ocr-eng
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                     Version                   Architecture              Description
+++-========================================-=========================-=========================-===============================================================
ii  tesseract-ocr                            4.00~git2219-40f43111-1.2 amd64                     Tesseract command line OCR tool
ii  tesseract-ocr-eng                        4.00~git24-0e00fe6-1.2    all                       tesseract-ocr language files for English
$ tesseract --version
tesseract 4.00.00alpha
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.30 : libtiff 4.0.8 : zlib 1.2.8 : libwebp 0.6.0 : libopenjp2 2.1.2

 Found AVX
 Found SSE
$ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.
Shreeshrii commented 6 years ago

Tag (or don't tag) as you see fit.

@zdenop I think you should tag a release so that other distros can also be updated.

I also suggest to update the version string to match 4.00~git2219-40f43111-1.2 or similar format. There is lot of confusion with tesseract 4.00.00alpha which applies to hundreds of commits.

jbreiden commented 6 years ago

Very Sorry! I misread the dashboards. Looks like the slightly older code 4.00~git2207-766b7bd6-3.1 will ship, which is missing some of the last minute improvements. I believe it is no longer possible to change the version string (or anything else about Tesseract) for Ubuntu 18.04.

zdenop commented 6 years ago
  1. Tagging repo will cause release in github and AFAIR it cause problem for some people.
  2. Other distribution will took:
    • the latest github master (to include all additional fixes)
    • the latest stable release Nobody would care what did other distribution... I would prefer Ray give clear statement about next step for 4.0 release.
stweil commented 6 years ago

Tagging repo will cause release in github [...]

That's desired. GitHub also allows marking such releases as pre-release – just edit the release information of the new release. That should minimize problems for other people.

The release of today would be 4.0.0-alpha.20180302.

zdenop commented 6 years ago

ok. but do we expect more code/fixes to come for 4.0 release?

Dňa pi 2. 3. 2018, 7:23 Stefan Weil notifications@github.com napísal(a):

Tagging repo will cause release in github [...]

That's desired. GitHub also allows marking such releases as pre-release – just edit the release information of the new release.

The release of today would be 4.0.0-alpha.20180302.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-369833449, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjCzIIWW3apaQrYsMG4GXodJd9gQZftks5taOVegaJpZM4N9Nel .

stweil commented 6 years ago

Yes, why not? I don't plan to stop sending code / fixes. :-), other people will continue sending fixes, too. So either we'll have a 4.0.0-alpha.20180401, or a 4.0.0 without alpha, or a 4.0.1, or Ray sends a bunch of code which justifies a 4.1.0, ...

Shreeshrii commented 6 years ago

I would prefer Ray give clear statement about next step for 4.0 release.

@jbreiden Please check with Ray. Thanks!

jbreiden commented 6 years ago

I would prefer Ray to speak for himself, too! However, I don't think there will be large Tesseract changes from him in either short or medium term.

amitdo commented 6 years ago

Zdenko, I also think we should finally release 4.0.0. It's time to get rid of the alpha status.

amitdo commented 6 years ago

If you decide to release it soon, don't forget to first update ccutil/version.h

jbreiden commented 6 years ago

Ha! Looks like they took 40f43111 after all, one day after deadline.

https://launchpad.net/ubuntu/+source/tesseract

zdenop commented 6 years ago

Jeff,

is there any info from Ray about 4.00 release? Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?

Zdenko

2018-03-03 5:21 GMT+01:00 jbreiden notifications@github.com:

Ha! Looks like they took 40f43111 after all, one day after deadline.

https://launchpad.net/ubuntu/+source/tesseract

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-370116512, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjCzCbeVivFu4oYMvMPgTLdJgz3NaUUks5tahpZgaJpZM4N9Nel .

amitdo commented 6 years ago

We should make the decision ourselves.

What about this proposal:

https://semver.org/ https://packages.ubuntu.com/bionic/tesseract-ocr

Mark any non final 4.0.0 as 'pre-release'.

https://help.github.com/articles/creating-releases/

  1. If the release is unstable, select This is a pre-release to notify users that it's not ready for production.

For each (pre-)release, update ccutil/version.h. https://github.com/tesseract-ocr/tesseract/blob/master/ccutil/version.h

jbreiden commented 6 years ago

is there any info from Ray about 4.00 release?

No info.

Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?

Millions of people will use commit 40f4311 because of Ubuntu, and I think the main benefit of a tag is to help understand bug reports coming from these users. There have been many good tag proposals in this thread from @amitdo and @stweil and @zdenop and @WilliamTambellini. I don't have a strong opinion about which one is best. If I was forced to choose, I'd probably tag commit 40f4311 with 4.0.0-beta.1 If that feels like too much commitment, then use a very specific tag like ubuntu18.04. Whatever is chosen, I think it makes sense to apply the same tag to the fast training data at commit 0e00fe6.

Shreeshrii commented 6 years ago

Jeff, the traineddata files have a version string of 4.00.00alpha with a date (062917 if I remember correctly). tesseract also reports version of 4.00.00alpha. Will it be possible to change these in the Ubuntu 18.04 packages now?

jbreiden commented 6 years ago

No more changes possible. Everything will look exactly as described here: https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-369704920

zdenop commented 6 years ago

done.

Zdenko

2018-03-09 19:25 GMT+01:00 jbreiden notifications@github.com:

is there any info from Ray about 4.00 release?

No info. Ray is very busy with other work, so I don't expect major changes from him in short or medium term.

Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?

Millions of people will use commit 40f4311 https://github.com/tesseract-ocr/tesseract/commit/40f43111e05b3dd2f2f8aeae3aba33016523c881 because of Ubuntu, and I think the main benefit of a tag is to help understand bug reports coming from these users. There have been many good tag proposals in this thread from @amitdo https://github.com/amitdo and @stweil https://github.com/stweil and @zdenop https://github.com/zdenop and @WilliamTambellini https://github.com/williamtambellini. I don't have a strong opinion about which one is best. If I was forced to choose, I'd probably tag commit 40f4311 https://github.com/tesseract-ocr/tesseract/commit/40f43111e05b3dd2f2f8aeae3aba33016523c881 with 4.0.0-beta.1 If that feels like too much commitment, then use a very specific tag like ubuntu18.04. Whatever is chosen, I think it makes sense to apply the same tag to the fast training data at commit 0e00fe6.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-371902529, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjCzH06ukxAiEbv4gYXlxJNbEYyrIzMks5tcskSgaJpZM4N9Nel .

Shreeshrii commented 6 years ago

Great!!!!

On Sat 10 Mar, 2018, 1:12 PM zdenop, notifications@github.com wrote:

done.

Zdenko

amitdo commented 6 years ago

I suggest that we release 4.0.0 (final) until end of April. ~2 week before this release, we should release 4.0.0-rc.1.