tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.23k stars 9.51k forks source link

RFC: Tesseract 4.0.0 – open tasks #1423

Closed stweil closed 6 years ago

stweil commented 6 years ago

I'd like to collect open tasks which should be addressed before tagging the official release 4.0.0.

These tasks are on my own list and to be discussed whether we consider them important for the new release or not:

amitdo commented 6 years ago

Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...).

SSE and AVX are also done on CPU :)

amitdo commented 6 years ago

Remove deprecated code. This does not include OpenCL or the old Tesseract engine.

Adding a compile option NO_LEGACY_OCR_ENGINE would be nice.

amitdo commented 6 years ago

I'll do it.

Shreeshrii commented 6 years ago

Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.

My suggestion would be to leave --list-langs as is,

and add this as --list-langs-details

or as --list-lang-details for one language file based on lang-code.

Shreeshrii commented 6 years ago

--list-langs should also display the directory it is using. This is useful when tessdata files ate installed in multiple directories, eg. By ppa or Linux distribution vs when built directory.

Shreeshrii commented 6 years ago

Re: tessdata, Config and tessconfigs and pdf.ttf are needed in the directory which is being used via tessdata_prefix or tessdata-dir.

Eg. When doing lstm training, lstm.train config file is not found if one uses tessdata_best as the continue_from dir.

My workaround has been to copy these to both tessdata_fast and tessdata_best repos.

Shreeshrii commented 6 years ago

Add/implement install-langs.

jbreiden commented 6 years ago

A week with no API changes.

Shreeshrii commented 6 years ago

Add a simple bash script for building tesseract.

I use the following, it should probably also add commands to offer to download osd and eng traineddata files for first time users.

#!/bin/bash
./autogen.sh
./configure --disable-openmp  --disable-graphics --disable-opencl
make
sudo make install
sudo ldconfig
make training
sudo make training-install

rm -rf ./googletest
git submodule update --init
autoreconf -fiv
#export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
export TESSDATA_PREFIX=../tessdata_fast
make check
zdenop commented 6 years ago

I would add this:

amitdo commented 6 years ago

A week with no API changes.

Mission impossible.


Edit: That was a joke.

zdenop commented 6 years ago

There was (online) tool that is monitoring API changes (for tesseract). But I can not find a link for it. Does somebody has it? Can somebody show changes 4.0.beta1 vs. current code?

Shreeshrii commented 6 years ago

Please see https://github.com/tesseract-ocr/tesseract/issues/793

The tracker is at https://abi-laboratory.pro/tracker/timeline/tesseract/ Currently it is tracking stable release 3.05.01

@zdenop Please tag another release for 3.05 branch since 3.05.01 had a couple of problems which have been fixed in later commits.

stweil commented 6 years ago

~The good news is that the latest Debian / Ubuntu tesseract-ocr does not include the development files, so there will not be any API between that version and the future 4.0.0 which we have to take care of.~

Sorry, I was wrong: there is libtesseract-dev.

Shreeshrii commented 6 years ago

@zdenop I suggest adding labels to issues with the following proposed list of keywords, so that it is easy to see related issues and see if there are any critical pending issues.

4.0.0 for the final relaese 4.0x for 4.00.00alpha and 4.0.0-beta.1 3.0x for 3.05/3.04

LSTM training training for 3.0x legacy tesseract training

Accuracy for reports of incorrect recognition Performance for questions related to speed Crashes for asserts and program crashes

Build related to compile and build from source

This is a suggested list.

amitdo commented 6 years ago

IMO, our final 4.0.0 should not significantly diverge from the version that will be shipped in Ubuntu 18.04.

A new branch should be created for 4.0.0. Only commits that follow the above rules should be backported from master. 4.0.0 should have at least rc.1 before final release.

We can decide that 4.1.0 will be released 2-3 months after 4.0.0 (still with legacy?).

stweil commented 6 years ago

How do you define "significantly"? There are some changes with the latest Git master:

Would you suggest reverting these changes? They are major changes which require a step of the major version, so I think 4.0.0 is a good candidate to include those changes. Otherwise we would have to wait for 5.0.0.

I would even go further and fix potential name space problems with the 58 include files which are part of the Tesseract programming API in 4.0.0-beta.1, although that is a significant change, too.

amitdo commented 6 years ago

How do you define "significantly"?

basically, any bug fix is ok, must follow the 2 conditions I specified, no new features.

Shreeshrii commented 6 years ago

What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C

I think our aim should be to get all significant changes included in final 4.0.0 and get it ready in time for Ubuntu 18.10. What are the deadlines for that?

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Mar 27, 2018 at 5:01 PM, Amit D. notifications@github.com wrote:

How do you define "significantly"?

basically, any bug fix is ok, must follow the 2 conditions I specified, no new features.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376491580, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7atyVy_7E3uk81VhUn_tqFXFJ3-ks5tiiMogaJpZM4S57Iv .

amitdo commented 6 years ago

18.04 is much more significant because it's LTS - supported for 5 years. 18.10 will be supported for only 9 months. We should not care about it.

amitdo commented 6 years ago

What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C

We tagged it as 4.0.0-beta.1.

amitdo commented 6 years ago

Another option is to skip final 4.0.0 and go straight to 5.0.0.

Shreeshrii commented 6 years ago

As per Jeff, we can't make any changes to what is shipped for 18.04.

But we still have time to do another beta, rc-1 and final 4.0.0 release in time for 18.10.

I do not really know much about Linux releases, but my hope would be that users would be able to install/upgrade to the 4.0.0 final version shipped with 18.10 on 18.04.

@AlexanderP please explain whether the above is possible.

On Tue 27 Mar, 2018, 5:48 PM Amit D., notifications@github.com wrote:

18.04 is a much more significant because it's LTS - supported for 5 years. 18.10 will be supported for only 9 month. We should not care about it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376503682, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1f3WICsaeI5d2ge9MMOvA8axn5xks5tii4PgaJpZM4S57Iv .

amitdo commented 6 years ago

@zdenop, your thoughts about these two options?

Shreeshrii commented 6 years ago

On Tue 27 Mar, 2018, 5:58 PM Amit D., notifications@github.com wrote:

What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C

We tagged it as 4.0.0-beta.1.

Yes, that tag is within github.

Please see the post by Jeff, where he has shown what tesseract -v will report for 18.04.

Shreeshrii commented 6 years ago

What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C

We tagged it as 4.0.0-beta.1.

Yes, that tag is within github.

Please see the post by Jeff, where he has shown what tesseract -v will report for 18.04.

Here is the link:

https://github.com/tesseract-ocr/tesseract/issues/995#comment-369704920

amitdo commented 6 years ago

Jeff just said that the version in Ubuntu won't change in final 18.04.

We are talking about what we want to do in Tessseract's official Github repo. We are the upstream, not Ubuntu!

Shreeshrii commented 6 years ago

IMO, our final 4.0.0 should not significantly diverge from the version that will be shipped in Ubuntu 18.04.

I am trying to understand how 4.0.0 final release on github relates to Ubuntu 18.04, in light of the above.

I am missing your reasoning for why it should not significantly diverge.

On Tue 27 Mar, 2018, 6:16 PM Amit D., notifications@github.com wrote:

Jeff just said that the the version in Ubuntu won't change in final 18.04.

We are talking about what we want to do in Tessseract's official Github repo. We are the upstream, not Ubuntu!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376511680, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o62Ddg3LsJ9b5FQXiigM96Fy1wGoks5tijS_gaJpZM4S57Iv .

amitdo commented 6 years ago

I want to hear @zdenop's and @jbreiden's opinions.

I think that as maintainers, they will understand (but not necessary agree with) my proposal.

zdenop commented 6 years ago

First of all I would like to know if final 4.0 release will be included in updates of Ubuntu (18.04)/Debian... If yes that we should release 4.0 ASAP (e.g. fix of issues will be accepted, no code changes).

Next I would like see report like this to better understand last changes.

Then we can decide how 4.0 will be release:

  1. as branch started from 4.0.0-beta.1 tag (no changes in master branch - only fixes will be ported to 4.0 release branch)
  2. or from master (we accept all applied commits for now.)

I do not expect to revert any commit in master.

amitdo commented 6 years ago

as branch started from 4.0.0-beta.1 tag (no changes in master branch - only fixes will be ported to 4.0 release branch)

I do not expect to revert any commit in master.

Yes, what you wrote here is what I meant.

AlexanderP commented 6 years ago

As per Jeff, we can't make any changes to what is shipped for 18.04.

But we still have time to do another beta, rc-1 and final 4.0.0 release in time for 18.10.

I do not really know much about Linux releases, but my hope would be that users would be able to install/upgrade to the 4.0.0 final version shipped with 18.10 on 18.04.

@AlexanderP please explain whether the above is possible.

@Shreeshrii Updating shall will come to the end without problems

jbreiden commented 6 years ago

Please don't worry too much about Ubuntu, everything is going to be fine. I've had a crazy day today, but will have time tomorrow to discuss.

jbreiden commented 6 years ago

First of all I would like to know if final 4.0 release will be included in updates of Ubuntu (18.04)/Debian...

The version of Tesseract that ships with Ubuntu 18.04 will not change, unless there is a major security issue. See this chart for shipping Tesseract versions for different Ubuntu releases. https://launchpad.net/ubuntu/+source/tesseract

my hope would be that users would be able to install/upgrade to the 4.0.0 final version shipped with 18.10 on 18.04.

Ubuntu users have many choices if they want a newer Tesseract. They can build from source. They can install from Alexander's PPA. There's something called a "snap" which I don't know too much about. Maybe other ways too.

Shipping alpha/beta software in final LTS was/is a really bad idea. I bet it's against Ubuntu's policies.

This decision belongs to the Debian/Ubuntu package maintainers, which is Alexander and myself. I am a member of the Debian Project, and sponsored Alexander's excellent packaging work as official. I thought users would significantly benefit from the improved accuracy of LSTM Tesseract. I think (and hope) most developers will understand that the Tesseract API is still changing, and not have too much trouble.

We are the upstream, not Ubuntu!

That's right! Don't feel constrained. It is perfectly okay for Tesseract to change API before final release. If the API changes, Ubuntu and other Linux distributions will deal with it, and it won't be too hard. For example, in Ubuntu, the only direct dependencies on libtesseract4 are gimagereader libavfilter-extra6 libopenalpr2 libopencv-contrib3.2 and libsikulixapi-jni. These programs use just a tiny fraction of Tesseract's API. It will be up to Alexander and myself to make sure everything continues to work well together in Debian/Ubuntu both now and in the future.

stweil commented 6 years ago

Alexander and Jeff, I'll support you where needed, too, of course.

amitdo commented 6 years ago

Jeff, Alexander, I’m sorry that I caused offense.

jbreiden commented 6 years ago

@amitdo No offense taken. We are all on the same team.

zdenop commented 6 years ago

@stweil : Are you interested in warnings from VS2017? I was able to build tesseract with cmake, cppan an VS2017.

stweil commented 6 years ago

Are those warnings the same as the warnings from the Appveyor CI build? And did you compile using Visual Studio Community? One of my colleagues might be interested, as he does more programming with Tesseract on Windows. I'm more focused on Linux and only look on macOS and Windows from time to time.

zdenop commented 6 years ago

I just check them and it seems to be the same.

amitdo commented 6 years ago

4.00-alpha was 'released' in November 2016.

I think we should release a final 4.0.0 soon.

@stweil, is it fine with you if we decide on releasing 4.0.0-rc.1 in May 15? After rc-1, no new features should go to 4.0.x branch, only bug fixes.

4.0.0 (final) will be released 2-6 weeks after rc.1.

Shreeshrii commented 6 years ago

@jbreiden A number of training related issues are because of lack of updated langdata. Ray had mentioned a few days back that the files are available in google repo and could be transferred after deleting extra files.

Any update regarding that.

I think the final release should include updated langdata also.

jbreiden commented 6 years ago

@Shreeshrii Can you point me at Ray's comment please?

Shreeshrii commented 6 years ago

https://github.com/tesseract-ocr/langdata/issues/83#comment-374460335

Shreeshrii commented 6 years ago

theraysmith commented 23 days ago Hmm. Sorry. I thought I had done this in September. The Google repo is up-to-date apart from the redundant files that need to be deleted. I'll work with Jeff to get this done.

stweil commented 6 years ago

This issue is fine for discussions, but the overview gets a little bit lost. Therefore I just started a new page for the release planning in the Tesseract wiki. Comments and contributions are welcome!

Shreeshrii commented 6 years ago

@stweil Thanks for adding the planning page. It is much easier to see the open tasks and plans on it

On Thu 12 Apr, 2018, 12:35 AM Stefan Weil, notifications@github.com wrote:

This issue is fine for discussions, but the overview gets a little bit lost. Therefore I just started a new page for the release planning https://github.com/tesseract-ocr/tesseract/wiki/Planning in the Tesseract wiki. Comments and contributions are welcome!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-380562244, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0aQrt2rsNd-Fa1SURx2qY-uOG-Rks5tnlQWgaJpZM4S57Iv .

Shreeshrii commented 6 years ago

Adding some more issues below which could be fixed for 4.0.0

stweil commented 6 years ago

Not to forget the endianness issue (see #518, #1525). For Linux distributions, the current status (big endian Tesseract 4.0 crashes) is not acceptable.

Update: The endianness issue is fixed now.

amitdo commented 6 years ago

@stweil, what should be our next step?

What about a timeline?