Closed Shreeshrii closed 6 years ago
Hi everybody Looks like the dec 15 release is/was a good milestone and at least a good test on Ubuntu 18. What about now creating a "4.0.0beta" tag ? Kind
The package version number is 4.00~git2188-cdc35338-2 so that's commit cdc35338
. Maybe give it a little time to settle? We had a critical bug the other day, but that turned out to be in Leptonica.
Hi everybody Looks like the dec 15 release is/was a good milestone and at least a good test on Ubuntu 18. What about now creating a "4.0.0beta" tag ? Kind
https://wiki.ubuntu.com/BionicBeaver/ReleaseSchedule
January 11th | Alpha 1 February 1st Alpha 2 March 1st FeatureFreeze, Debian Import Freeze March 8th Beta 1 Freeze April 5th Final Beta Freeze , Final Beta April 19th FinalFreeze, ReleaseCandidate April 26th FinalRelease, Ubuntu 18.04
The suggested tag would be 4.0.0-beta.20180105
(for today), see the discussion above.
Well I dont really care about the name of the tag as long as one is created soon. Reminder: a tag is 'just' a tag (it's not a branch), just super convenient to compare between different milestones.
@jbreiden
Jeff, Would it be possible for you to update the langdata repository to match the 4.00alpha tessdata files, on behalf of @theraysmith ? It would help out those who are trying to finetune traineddata for their specific languages. Thanks!
edit: It will also address the requirement of debian regarding language source files. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=699609
Updating the langdata files would also help to identify and fix systematic bugs for future trainings.
@Shreeshrii @stweil Okay, I'll investigate. Will probably take some time.
@jbreiden
How are you going to support SIMD with the debian/ubuntu binary package? https://lists.debian.org/debian-mentors/2017/03/msg00163.html
I didn't do anything special with packaging for SIMD. Basically shipped the code as packaged by Alexander and am basically waiting for the bug reports to start rolling in. Can someone remind me which processors are going to have trouble, and what will that look like for a user perspective at runtime? (I forgot that my Pentium G4560 chip was useful for testing such things, so I gave it to young child to play with.)
Normally all kinds of processors should work, because Tesseract tests at runtime whether the CPU supports AVX2 or SSE and chooses the right code automatically.
Problems occurred in the past with virtual machines which claimed to support AVX2 but did not do so. That case needs an improved runtime test (which is still missing) to work.
There is also a build time detection which adds -msse4.2
/ -mavx
/ -mavx2
flags.
The build time detection is just about the compiler, right? I built the X86_64 package on an Intel Xenon E5-1650 which does not have AVX2. But that's fine and doesn't hurt anyone. Right? Right? It's got to be right.
checking whether C++ compiler accepts -mavx... yes
checking whether C++ compiler accepts -mavx2... yes
checking whether C++ compiler accepts -msse4.1... yes
Yes, that's perfect. You can then run tesseract -v
to see which SIMD instructions were detected for your CPU.
Here is the actual script it uses: https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html
You can then run tesseract -v to see which SIMD instructions were detected for your CPU.
This is done by the runtime detection.
I think that when you use a flag like -msse4.2
the compiler can automatically use sse4.2 instructions anywhere in the code. The sse4.2 code will cause SIGILL in machines that lack sse4.2 instructions.
... but Tesseract does not use these flags globally. It uses them only in arch/Makefile.am
.
I hope that this approach is enough to save you from the above issue.
@jbreiden
https://packages.debian.org/sid/tesseract-ocr
Tesseract command line OCR tool
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google and is probably one of the most accurate open source OCR engines available. It can read a wide variety of image formats and convert them to text in over 40 languages. This package includes the command line tool.
I suggest to change it to something like this:
The Tesseract OCR engine was originally developed by HP between 1985 and 1998. Since 2006 it has been developed as an open source project by Google. It can read a wide variety of image formats and convert them to text. It supports over 120 languages.
This package includes the command line tool.
How about I copy the description in the Wiki or README file? (And should they be synchronized?)
Tesseract is an open source Optical Character Recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract typed, handwritten or printed text from images. It supports a wide variety of languages.
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.
No problem. I mainly dislike the '23 years ago it was one of the best'. WHO CARES? (Sorry Ray).
That's why I removed it from the README.
Why is the README talking about handwriting? Tesseract is terrible at handwriting.
You mean the wiki Home, not the README.
I fixed it.
WHO CARES?
Does anybody care that the Wiki (on GitHub, but also the English Wikipedia) still says Optical Character Recognition although the old Tesseract detects glyphs and the new Tesseract detects lines of text? Would the following text be better?
Tesseract is an open source text recognition ("Optical Character Recognition" = OCR) engine [...]
The term for what Tesseract is doing is OCR. Even if not accurate, that's the term people recognize. It's not our job to invent new terms.
Sure. OCR is also part of the GitHub repository name. I do not want to change that. That's why it is still part of the new text which I suggested.
https://www.google.co.il/search?q=%22text+recognition%22
It seems that the term 'text recognition' is commonly used as replacement for 'OCR' :)
Yes. It is quite common that abbreviations live much longer than their original meaning, so that original meaning remains only relevant for encyclopedias. Example: search for 'machines' on the IBM website. You won't find that word, although the 'M' is still part of the name.
I still sometimes use the term 'machine' to refer to a computer. Maybe I'm too old (or just a geek?). :-)
Here's what will ship with Ubuntu 18.04. Tag (or don't tag) as you see fit.
Tesseract is an open source Optical Character Recognition (OCR)
Engine. It can be used directly, or (for programmers) using an API to
extract printed text from images. It supports a wide variety of
languages. This package includes the command line tool.
$ dpkg -l tesseract-ocr tesseract-ocr-eng
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-========================================-=========================-=========================-===============================================================
ii tesseract-ocr 4.00~git2219-40f43111-1.2 amd64 Tesseract command line OCR tool
ii tesseract-ocr-eng 4.00~git24-0e00fe6-1.2 all tesseract-ocr language files for English
$ tesseract --version
tesseract 4.00.00alpha
leptonica-1.75.3
libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.30 : libtiff 4.0.8 : zlib 1.2.8 : libwebp 0.6.0 : libopenjp2 2.1.2
Found AVX
Found SSE
$ tesseract
Usage:
tesseract --help | --help-extra | --version
tesseract --list-langs
tesseract imagename outputbase [options...] [configfile...]
OCR options:
-l LANG[+LANG] Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.
Single options:
--help Show this help message.
--help-extra Show extra help for advanced users.
--version Show version information.
--list-langs List available languages for tesseract engine.
Tag (or don't tag) as you see fit.
@zdenop I think you should tag a release so that other distros can also be updated.
I also suggest to update the version string to match 4.00~git2219-40f43111-1.2 or similar format. There is lot of confusion with tesseract 4.00.00alpha
which applies to hundreds of commits.
Very Sorry! I misread the dashboards. Looks like the slightly older code 4.00~git2207-766b7bd6-3.1
will ship, which is missing some of the last minute improvements. I believe it is no longer possible to change the version string (or anything else about Tesseract) for Ubuntu 18.04.
Tagging repo will cause release in github [...]
That's desired. GitHub also allows marking such releases as pre-release – just edit the release information of the new release. That should minimize problems for other people.
The release of today would be 4.0.0-alpha.20180302.
ok. but do we expect more code/fixes to come for 4.0 release?
Dňa pi 2. 3. 2018, 7:23 Stefan Weil notifications@github.com napísal(a):
Tagging repo will cause release in github [...]
That's desired. GitHub also allows marking such releases as pre-release – just edit the release information of the new release.
The release of today would be 4.0.0-alpha.20180302.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-369833449, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjCzIIWW3apaQrYsMG4GXodJd9gQZftks5taOVegaJpZM4N9Nel .
Yes, why not? I don't plan to stop sending code / fixes. :-), other people will continue sending fixes, too. So either we'll have a 4.0.0-alpha.20180401, or a 4.0.0 without alpha, or a 4.0.1, or Ray sends a bunch of code which justifies a 4.1.0, ...
I would prefer Ray give clear statement about next step for 4.0 release.
@jbreiden Please check with Ray. Thanks!
I would prefer Ray to speak for himself, too! However, I don't think there will be large Tesseract changes from him in either short or medium term.
Zdenko, I also think we should finally release 4.0.0. It's time to get rid of the alpha status.
If you decide to release it soon, don't forget to first update ccutil/version.h
Ha! Looks like they took 40f43111
after all, one day after deadline.
Jeff,
is there any info from Ray about 4.00 release? Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?
Zdenko
2018-03-03 5:21 GMT+01:00 jbreiden notifications@github.com:
Ha! Looks like they took 40f43111 after all, one day after deadline.
https://launchpad.net/ubuntu/+source/tesseract
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-370116512, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjCzCbeVivFu4oYMvMPgTLdJgz3NaUUks5tahpZgaJpZM4N9Nel .
We should make the decision ourselves.
What about this proposal:
4.00-alpha.2+git.2219.40f43111
.4.0.0-beta.1
.4.0.0
30-60 days after beta1 (maybe with one more beta and one rc).https://semver.org/ https://packages.ubuntu.com/bionic/tesseract-ocr
Mark any non final 4.0.0 as 'pre-release'.
https://help.github.com/articles/creating-releases/
- If the release is unstable, select This is a pre-release to notify users that it's not ready for production.
For each (pre-)release, update ccutil/version.h
.
https://github.com/tesseract-ocr/tesseract/blob/master/ccutil/version.h
is there any info from Ray about 4.00 release?
No info.
Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?
Millions of people will use commit 40f4311 because of Ubuntu, and I think the main benefit of a tag is to help understand bug reports coming from these users. There have been many good tag proposals in this thread from @amitdo and @stweil and @zdenop and @WilliamTambellini. I don't have a strong opinion about which one is best. If I was forced to choose, I'd probably tag commit 40f4311 with 4.0.0-beta.1
If that feels like too much commitment, then use a very specific tag like ubuntu18.04
. Whatever is chosen, I think it makes sense to apply the same tag to the fast training data at commit 0e00fe6.
Jeff, the traineddata files have a version string of 4.00.00alpha with a date (062917 if I remember correctly). tesseract also reports version of 4.00.00alpha. Will it be possible to change these in the Ubuntu 18.04 packages now?
No more changes possible. Everything will look exactly as described here: https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-369704920
done.
Zdenko
2018-03-09 19:25 GMT+01:00 jbreiden notifications@github.com:
is there any info from Ray about 4.00 release?
No info. Ray is very busy with other work, so I don't expect major changes from him in short or medium term.
Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?
Millions of people will use commit 40f4311 https://github.com/tesseract-ocr/tesseract/commit/40f43111e05b3dd2f2f8aeae3aba33016523c881 because of Ubuntu, and I think the main benefit of a tag is to help understand bug reports coming from these users. There have been many good tag proposals in this thread from @amitdo https://github.com/amitdo and @stweil https://github.com/stweil and @zdenop https://github.com/zdenop and @WilliamTambellini https://github.com/williamtambellini. I don't have a strong opinion about which one is best. If I was forced to choose, I'd probably tag commit 40f4311 https://github.com/tesseract-ocr/tesseract/commit/40f43111e05b3dd2f2f8aeae3aba33016523c881 with 4.0.0-beta.1 If that feels like too much commitment, then use a very specific tag like ubuntu18.04. Whatever is chosen, I think it makes sense to apply the same tag to the fast training data at commit 0e00fe6.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-371902529, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjCzH06ukxAiEbv4gYXlxJNbEYyrIzMks5tcskSgaJpZM4N9Nel .
I suggest that we release 4.0.0
(final) until end of April. ~2 week before this release, we should release 4.0.0-rc.1
.
Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.
@zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!