Closed stweil closed 6 years ago
Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...).
SSE and AVX are also done on CPU :)
Remove deprecated code. This does not include OpenCL or the old Tesseract engine.
Adding a compile option NO_LEGACY_OCR_ENGINE
would be nice.
I'll do it.
Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.
My suggestion would be to leave --list-langs as is,
and add this as --list-langs-details
or as --list-lang-details for one language file based on lang-code.
--list-langs should also display the directory it is using. This is useful when tessdata files ate installed in multiple directories, eg. By ppa or Linux distribution vs when built directory.
Re: tessdata, Config and tessconfigs and pdf.ttf are needed in the directory which is being used via tessdata_prefix or tessdata-dir.
Eg. When doing lstm training, lstm.train config file is not found if one uses tessdata_best as the continue_from dir.
My workaround has been to copy these to both tessdata_fast and tessdata_best repos.
Add/implement install-langs.
A week with no API changes.
Add a simple bash script for building tesseract.
I use the following, it should probably also add commands to offer to download osd and eng traineddata files for first time users.
#!/bin/bash
./autogen.sh
./configure --disable-openmp --disable-graphics --disable-opencl
make
sudo make install
sudo ldconfig
make training
sudo make training-install
rm -rf ./googletest
git submodule update --init
autoreconf -fiv
#export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
export TESSDATA_PREFIX=../tessdata_fast
make check
I would add this:
A week with no API changes.
Mission impossible.
Edit: That was a joke.
There was (online) tool that is monitoring API changes (for tesseract). But I can not find a link for it. Does somebody has it? Can somebody show changes 4.0.beta1 vs. current code?
Please see https://github.com/tesseract-ocr/tesseract/issues/793
The tracker is at https://abi-laboratory.pro/tracker/timeline/tesseract/ Currently it is tracking stable release 3.05.01
@zdenop Please tag another release for 3.05 branch since 3.05.01 had a couple of problems which have been fixed in later commits.
~The good news is that the latest Debian / Ubuntu tesseract-ocr does not include the development files, so there will not be any API between that version and the future 4.0.0 which we have to take care of.~
Sorry, I was wrong: there is libtesseract-dev.
@zdenop I suggest adding labels to issues with the following proposed list of keywords, so that it is easy to see related issues and see if there are any critical pending issues.
4.0.0 for the final relaese 4.0x for 4.00.00alpha and 4.0.0-beta.1 3.0x for 3.05/3.04
LSTM training training for 3.0x legacy tesseract training
Accuracy for reports of incorrect recognition Performance for questions related to speed Crashes for asserts and program crashes
Build related to compile and build from source
This is a suggested list.
IMO, our final 4.0.0 should not significantly diverge from the version that will be shipped in Ubuntu 18.04.
A new branch should be created for 4.0.0. Only commits that follow the above rules should be backported from master. 4.0.0 should have at least rc.1 before final release.
We can decide that 4.1.0 will be released 2-3 months after 4.0.0 (still with legacy?).
How do you define "significantly"? There are some changes with the latest Git master:
inT32
, ...) and macros (MIN_INT32
, ...) were removed.Would you suggest reverting these changes? They are major changes which require a step of the major version, so I think 4.0.0 is a good candidate to include those changes. Otherwise we would have to wait for 5.0.0.
I would even go further and fix potential name space problems with the 58 include files which are part of the Tesseract programming API in 4.0.0-beta.1, although that is a significant change, too.
How do you define "significantly"?
basically, any bug fix is ok, must follow the 2 conditions I specified, no new features.
What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C
I think our aim should be to get all significant changes included in final 4.0.0 and get it ready in time for Ubuntu 18.10. What are the deadlines for that?
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Mar 27, 2018 at 5:01 PM, Amit D. notifications@github.com wrote:
How do you define "significantly"?
basically, any bug fix is ok, must follow the 2 conditions I specified, no new features.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376491580, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7atyVy_7E3uk81VhUn_tqFXFJ3-ks5tiiMogaJpZM4S57Iv .
18.04 is much more significant because it's LTS - supported for 5 years. 18.10 will be supported for only 9 months. We should not care about it.
What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C
We tagged it as 4.0.0-beta.1.
Another option is to skip final 4.0.0 and go straight to 5.0.0.
As per Jeff, we can't make any changes to what is shipped for 18.04.
But we still have time to do another beta, rc-1 and final 4.0.0 release in time for 18.10.
I do not really know much about Linux releases, but my hope would be that users would be able to install/upgrade to the 4.0.0 final version shipped with 18.10 on 18.04.
@AlexanderP please explain whether the above is possible.
On Tue 27 Mar, 2018, 5:48 PM Amit D., notifications@github.com wrote:
18.04 is a much more significant because it's LTS - supported for 5 years. 18.10 will be supported for only 9 month. We should not care about it.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376503682, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1f3WICsaeI5d2ge9MMOvA8axn5xks5tii4PgaJpZM4S57Iv .
@zdenop, your thoughts about these two options?
On Tue 27 Mar, 2018, 5:58 PM Amit D., notifications@github.com wrote:
What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C
We tagged it as 4.0.0-beta.1.
Yes, that tag is within github.
Please see the post by Jeff, where he has shown what tesseract -v will report for 18.04.
What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C
We tagged it as 4.0.0-beta.1.
Yes, that tag is within github.
Please see the post by Jeff, where he has shown what tesseract -v will report for 18.04.
Here is the link:
https://github.com/tesseract-ocr/tesseract/issues/995#comment-369704920
Jeff just said that the version in Ubuntu won't change in final 18.04.
We are talking about what we want to do in Tessseract's official Github repo. We are the upstream, not Ubuntu!
IMO, our final 4.0.0 should not significantly diverge from the version that will be shipped in Ubuntu 18.04.
I am trying to understand how 4.0.0 final release on github relates to Ubuntu 18.04, in light of the above.
I am missing your reasoning for why it should not significantly diverge.
On Tue 27 Mar, 2018, 6:16 PM Amit D., notifications@github.com wrote:
Jeff just said that the the version in Ubuntu won't change in final 18.04.
We are talking about what we want to do in Tessseract's official Github repo. We are the upstream, not Ubuntu!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376511680, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o62Ddg3LsJ9b5FQXiigM96Fy1wGoks5tijS_gaJpZM4S57Iv .
I want to hear @zdenop's and @jbreiden's opinions.
I think that as maintainers, they will understand (but not necessary agree with) my proposal.
First of all I would like to know if final 4.0 release will be included in updates of Ubuntu (18.04)/Debian... If yes that we should release 4.0 ASAP (e.g. fix of issues will be accepted, no code changes).
Next I would like see report like this to better understand last changes.
Then we can decide how 4.0 will be release:
I do not expect to revert any commit in master.
as branch started from 4.0.0-beta.1 tag (no changes in master branch - only fixes will be ported to 4.0 release branch)
I do not expect to revert any commit in master.
Yes, what you wrote here is what I meant.
As per Jeff, we can't make any changes to what is shipped for 18.04.
But we still have time to do another beta, rc-1 and final 4.0.0 release in time for 18.10.
I do not really know much about Linux releases, but my hope would be that users would be able to install/upgrade to the 4.0.0 final version shipped with 18.10 on 18.04.
@AlexanderP please explain whether the above is possible.
@Shreeshrii Updating shall will come to the end without problems
Please don't worry too much about Ubuntu, everything is going to be fine. I've had a crazy day today, but will have time tomorrow to discuss.
First of all I would like to know if final 4.0 release will be included in updates of Ubuntu (18.04)/Debian...
The version of Tesseract that ships with Ubuntu 18.04 will not change, unless there is a major security issue. See this chart for shipping Tesseract versions for different Ubuntu releases. https://launchpad.net/ubuntu/+source/tesseract
my hope would be that users would be able to install/upgrade to the 4.0.0 final version shipped with 18.10 on 18.04.
Ubuntu users have many choices if they want a newer Tesseract. They can build from source. They can install from Alexander's PPA. There's something called a "snap" which I don't know too much about. Maybe other ways too.
Shipping alpha/beta software in final LTS was/is a really bad idea. I bet it's against Ubuntu's policies.
This decision belongs to the Debian/Ubuntu package maintainers, which is Alexander and myself. I am a member of the Debian Project, and sponsored Alexander's excellent packaging work as official. I thought users would significantly benefit from the improved accuracy of LSTM Tesseract. I think (and hope) most developers will understand that the Tesseract API is still changing, and not have too much trouble.
We are the upstream, not Ubuntu!
That's right! Don't feel constrained. It is perfectly okay for Tesseract to change API before final release. If the API changes, Ubuntu and other Linux distributions will deal with it, and it won't be too hard. For example, in Ubuntu, the only direct dependencies on libtesseract4
are gimagereader
libavfilter-extra6
libopenalpr2
libopencv-contrib3.2
and libsikulixapi-jni
. These programs use just a tiny fraction of Tesseract's API. It will be up to Alexander and myself to make sure everything continues to work well together in Debian/Ubuntu both now and in the future.
Alexander and Jeff, I'll support you where needed, too, of course.
Jeff, Alexander, I’m sorry that I caused offense.
@amitdo No offense taken. We are all on the same team.
@stweil : Are you interested in warnings from VS2017? I was able to build tesseract with cmake, cppan an VS2017.
Are those warnings the same as the warnings from the Appveyor CI build? And did you compile using Visual Studio Community? One of my colleagues might be interested, as he does more programming with Tesseract on Windows. I'm more focused on Linux and only look on macOS and Windows from time to time.
I just check them and it seems to be the same.
4.00-alpha was 'released' in November 2016.
I think we should release a final 4.0.0 soon.
@stweil, is it fine with you if we decide on releasing 4.0.0-rc.1 in May 15? After rc-1, no new features should go to 4.0.x branch, only bug fixes.
4.0.0 (final) will be released 2-6 weeks after rc.1.
@jbreiden A number of training related issues are because of lack of updated langdata. Ray had mentioned a few days back that the files are available in google repo and could be transferred after deleting extra files.
Any update regarding that.
I think the final release should include updated langdata also.
@Shreeshrii Can you point me at Ray's comment please?
theraysmith commented 23 days ago Hmm. Sorry. I thought I had done this in September. The Google repo is up-to-date apart from the redundant files that need to be deleted. I'll work with Jeff to get this done.
This issue is fine for discussions, but the overview gets a little bit lost. Therefore I just started a new page for the release planning in the Tesseract wiki. Comments and contributions are welcome!
@stweil Thanks for adding the planning page. It is much easier to see the open tasks and plans on it
On Thu 12 Apr, 2018, 12:35 AM Stefan Weil, notifications@github.com wrote:
This issue is fine for discussions, but the overview gets a little bit lost. Therefore I just started a new page for the release planning https://github.com/tesseract-ocr/tesseract/wiki/Planning in the Tesseract wiki. Comments and contributions are welcome!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-380562244, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0aQrt2rsNd-Fa1SURx2qY-uOG-Rks5tnlQWgaJpZM4S57Iv .
Adding some more issues below which could be fixed for 4.0.0
Not to forget the endianness issue (see #518, #1525). For Linux distributions, the current status (big endian Tesseract 4.0 crashes) is not acceptable.
Update: The endianness issue is fixed now.
@stweil, what should be our next step?
What about a timeline?
I'd like to collect open tasks which should be addressed before tagging the official release 4.0.0.
These tasks are on my own list and to be discussed whether we consider them important for the new release or not:
--version
parameter for all command line commands.--list-langs
to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.