tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.14k stars 9.5k forks source link

Tag a new version for LSTM 4.0 #995

Closed Shreeshrii closed 6 years ago

Shreeshrii commented 7 years ago

Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.

@zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!

stweil commented 7 years ago

It would be good to decide about using semantic versioning soon. Maybe it can be used for the next tag.

Shreeshrii commented 7 years ago

I have not seen any comments against semver.

Maybe good to setup some kind of autoupdate for increasing the PATCH version based on commit numbers to reduce manual administrative updates.

@stweil From what I have read about semver, if you were to implement the zipped traineddata and related changes, it should cause a change in MINOR version.

So, with that should it be 4.1.0alpha ?

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,

MINOR version when you add functionality in a backwards-compatible manner,

and
PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as
extensions to the MAJOR.MINOR.PATCH format.
egorpugin commented 7 years ago

First 4 version will be 4.0.0. What 4.1.0alpha are you talking about? We don't care about changes in dev branches.

stweil commented 7 years ago

We could tag the current release as a pre-release or as a release candidate. According to semver.org, it could be called something like 4.0.0-rc.1 (that's how semver.org named its own releases), 4.0.0-beta.1 or 4.0.0-beta.20170619.

Shreeshrii commented 7 years ago

We don't care about changes in dev branches.

OK.

Still, it will be good to have new tags when changes are substantial enough from previous commits. For example,

That said, I have only done some cursory reading regarding semver. So, I am happy with whatever tag/version is used, as long as there is some demarcation.

The reason for asking for this is that people are using/trying to use master branch/4.0/LSTM and ask questions, where the version info says -alpha or -dev and it difficult to try and figure out what the issue is without knowing the version being used.

Shreeshrii commented 7 years ago

I vote for this format which includes date - easy to identify which version is more recent.

4.0.0-beta.20170619

Shreeshrii commented 7 years ago

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/V1tyGHIenbI/SUVuXheJAwAJ

An example of how 4.00.00alpha is NOT compatible with the current master branch eg. --oem options.

amitdo commented 7 years ago

@theraysmith, can you give us an update on your work? When are we going to see it?

WilliamTambellini commented 7 years ago

Hi, same: can you give us an update on your work? When are we going to see 4.0 released?

amitdo commented 7 years ago

+1 for a new tag.

Since Ray does not reply, I suggest to still use 'alpha'.

4.0.0-alpha.YYYYMMDD

amitdo commented 7 years ago

@zdenop, can you do it, or at least add your comment here?

theraysmith commented 7 years ago

I'm about ready to update the traineddatas. I have a training run almost complete, and with accuracy that meets with my satisfaction. There are a few regressions, but not too serious. First though, I have to get some code reviewed in Google, and then make some commits to github to match the new traineddatas. Before that, there is the matter of a major pull...

Here's what's coming:

I have other stuff that is still incomplete, but that is a good list for now.

BTW, in case you hadn't noticed, there was a breaking change that made old lstmf files unusable. That was needed to fix LSTM for OSD. It has to know the language of each training sample. The new traineddatas will mostly be smaller than the older ones, as they won't contain the legacy components, and no bigram dawgs are needed.

On Tue, Jul 11, 2017 at 4:49 AM, Amit D. notifications@github.com wrote:

@zdenop https://github.com/zdenop, can you do it, or at least add your comment here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314419211, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056SvL5FeeE09JYW01xQ-dQyILyU8Wks5sM2ExgaJpZM4N9Nel .

-- Ray.

WilliamTambellini commented 7 years ago

Superb. Anything we could do to help you ? Cheers.

Shreeshrii commented 7 years ago

@theraysmith Thanks for the update. Look forward to it. Any estimate of expected date?

@zdenop I think this is a good reason to freeze the 'alpha' state by tagging the repo with the current version as 4.0.0-alpha.YYYYMMDD, since Ray is going to be making major changes.

stweil commented 7 years ago

I'm about ready to update the traineddatas.

That's good news.

The above change makes open source training impossible.

If I got that right, it would be horrible. Being able to create new traineddata is essential for me.

zdenop commented 7 years ago

@Shreeshrii: I do not understand what do you want. Tag will not freeze anything. Tag is just specific points in history to mark something important (e.g. new version). Tagging should be driven by developer who knows roadmap and not by users...

Shreeshrii commented 7 years ago

@zdenop

Tag is just specific points in history to mark something important (e.g. new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase as well as traineddata substantially. I am sure it will be tagged by Ray at that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a point in history where a lot of development has taken place since the original 4.00.00alpha tag. In fact, that original tag just marked the start of the 4.00.00alpha development and many bugs in that original tag (missing lstm.train file etc.) have been fixed later.

Also, if the new changes by Ray will not allow for open source training :-( then the current github version will be the one which allows users to do their own training. So, it is certainly deserving of a tag in my opinion :-)

theraysmith commented 7 years ago

Open source training: OK, I overstated it a bit. One of my commits will temporarily break the training process. After doing so, I will correct the documentation and add the new tool (which I have already written) as quickly as possible after.

To help: No more breaking commits! If it doesn't produce perfect results on phototest, it broke something! Cutting down on the code cleanup while I am working on it will also help. When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match

Dates: I was going to get started this week, but now I have to debug my pull from github, which has broken tests (of the legacy engine), so that will take time to fix. I'm hoping it's simple, but it is bizarre. Even when it is fixed, there are 1500 lines of change from github for someone here to review. I really want to get 4.00 finished (in beta) in the next 5-6 weeks.

On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii notifications@github.com wrote:

@zdenop https://github.com/zdenop

Tag is just specific points in history to mark something important (e.g. new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase as well as traineddata substantially. I am sure it will be tagged by Ray at that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a point in history where a lot of development has taken place since the original 4.00.00alpha tag. In fact, that original tag just marked the start of the 4.00.00alpha development and many bugs in that original tag have been fixed by now.

Also, if the new changes by Ray will not allow for open source training :-( then the current github version will be the one which allows users to do their own training. So, it is certainly deserving of a tag in my opinion :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314667002, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel .

-- Ray.

Shreeshrii commented 7 years ago

When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there.

​What kind of expertise do you need regarding the Indic scripts?​

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 12, 2017 at 10:58 PM, theraysmith notifications@github.com wrote:

Open source training: OK, I overstated it a bit. One of my commits will temporarily break the training process. After doing so, I will correct the documentation and add the new tool (which I have already written) as quickly as possible after.

To help: No more breaking commits! If it doesn't produce perfect results on phototest, it broke something! Cutting down on the code cleanup while I am working on it will also help. When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match

Dates: I was going to get started this week, but now I have to debug my pull from github, which has broken tests (of the legacy engine), so that will take time to fix. I'm hoping it's simple, but it is bizarre. Even when it is fixed, there are 1500 lines of change from github for someone here to review. I really want to get 4.00 finished (in beta) in the next 5-6 weeks.

On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii notifications@github.com wrote:

@zdenop https://github.com/zdenop

Tag is just specific points in history to mark something important (e.g. new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase as well as traineddata substantially. I am sure it will be tagged by Ray at that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a point in history where a lot of development has taken place since the original 4.00.00alpha tag. In fact, that original tag just marked the start of the 4.00.00alpha development and many bugs in that original tag have been fixed by now.

Also, if the new changes by Ray will not allow for open source training :-( then the current github version will be the one which allows users to do their own training. So, it is certainly deserving of a tag in my opinion :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995# issuecomment-314667002, or mute the thread https://github.com/notifications/unsubscribe-auth/ AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel .

-- Ray.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314839820, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o71HIG266aj--aGRLLsL6s9gxF_Xks5sNQIjgaJpZM4N9Nel .

theraysmith commented 7 years ago

The code determines what makes a valid/invalid sequence of unicodes in the script, for instance, is it allowed to have two matras in a row? It gets more difficult with questions over what category the additional characters are.

On Wed, Jul 12, 2017 at 6:40 PM, Shreeshrii notifications@github.com wrote:

When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there.

​What kind of expertise do you need regarding the Indic scripts?​

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 12, 2017 at 10:58 PM, theraysmith notifications@github.com wrote:

Open source training: OK, I overstated it a bit. One of my commits will temporarily break the training process. After doing so, I will correct the documentation and add the new tool (which I have already written) as quickly as possible after.

To help: No more breaking commits! If it doesn't produce perfect results on phototest, it broke something! Cutting down on the code cleanup while I am working on it will also help. When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match

Dates: I was going to get started this week, but now I have to debug my pull from github, which has broken tests (of the legacy engine), so that will take time to fix. I'm hoping it's simple, but it is bizarre. Even when it is fixed, there are 1500 lines of change from github for someone here to review. I really want to get 4.00 finished (in beta) in the next 5-6 weeks.

On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii notifications@github.com wrote:

@zdenop https://github.com/zdenop

Tag is just specific points in history to mark something important (e.g. new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase as well as traineddata substantially. I am sure it will be tagged by Ray at that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a point in history where a lot of development has taken place since the original 4.00.00alpha tag. In fact, that original tag just marked the start of the 4.00.00alpha development and many bugs in that original tag have been fixed by now.

Also, if the new changes by Ray will not allow for open source training :-( then the current github version will be the one which allows users to do their own training. So, it is certainly deserving of a tag in my opinion :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995# issuecomment-314667002, or mute the thread https://github.com/notifications/unsubscribe-auth/ AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel .

-- Ray.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995# issuecomment-314839820, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o71HIG266aj-- aGRLLsL6s9gxF_Xks5sNQIjgaJpZM4N9Nel .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314945111, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cfTz_q0IPjUvI65YCy4HVMGAjH2ks5sNXWDgaJpZM4N9Nel .

-- Ray.

Shreeshrii commented 7 years ago

No, it is not valid to have any two matras in a row - Devanagari 093E-094C.

However, these can be followed by Anusvar, Chandrabindu or Visarga i.e. 0901-0903

In case of Vedic Sanskrit, these can be followed by the Vedic accents eg. 0951, 0952, 1CDA etc

However, I have seen samples in legacy fonts where a number of separate matras are used to create another one eg. using unicode points as example 093E followed by 0947 to create 094b - ा े to make ो

Similarly in legacy fonts, half letters (letter followed by virama) maybe followed by aa maatraa to create the complete letter in cases such as ga, sha etc. i.e. 0936 + 094D + 093E to create 0936 for sha

It is possible that some converters from legacy font to unicode retain these errors.

Also, in case of Vedic Sanskrit, the valid order should be matra, combining mark (anusvar, visarga), vedic accent . Some fonts incorrectly use matra, vedic accent and combining mark which will lead to dotted circle. eg. अंशाः॑ vs अंशा॑ः

For a sample of Vedic Sanskrit and its ground truth, see https://github.com/Shreeshrii/tess4training/blob/master/BRH-test.tif https://github.com/Shreeshrii/tess4training/blob/master/BRH-test.txt

Will your new sanskrit traineddata be able to OCR this?

amitdo commented 7 years ago

The new traineddatas will mostly be smaller than the older ones, as they won't contain the legacy components, and no bigram dawgs are needed.

Will you remove the code of the legacy engine in this round?

theraysmith commented 7 years ago

On Wed, Jul 12, 2017 at 9:39 PM, Shreeshrii notifications@github.com wrote:

No, it is not valid to have any two matras in a row - Devanagari 093E-094C.

However, these can be followed by Anusvar, Chandrabindu or Visarge i.e. 0901-0903

It seems that Malayalam is unique in allowing multiple 0d02 (Anusvara)?

In case of Vedic Sanskrit, these can be followed by the Vedic accents eg. 0951, 0952, 1CDA etc

However, I have seen samples in legacy fonts where a number of separate matras are used to create another one eg. 093E followed by 0947 to create 094b

These are specifically dis-allowed by unicode, but the rules seem to be very script-specific, and not very consistently documented in the unicode standard. I don't think the rules are addressed properly for all scripts.

Similarly in legacy fonts, half letters (letter followed by virama) maybe followed by aa maatraa to create the complete letter in cases such as ga, sha etc. i.e. 0936 + 094D + 093E to create 0936 for sha

It is possible that some converters from legacy font to unicode retain these errors.

Also, in case of Vedic Sanskrit, the valid order should be matra, combining mark (anusvar, visarga), vedic accent . Some fonts incorrectly use matra, vedic accent and combining mark which will lead to dotted circle. eg. अंशाः॑ vs अंशा॑ः

The code aims to dis-allow text designed for such legacy fonts. The documentation that I have found is very good for Devanagari, but lacking for some of the other scripts. For instance, there is a big table in the unicode standard for Myanmar, ( http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover any of the extension Myanmar characters, and isn't explicit about whether the table represents a specific valid order or not. The existence of a lot of legacy Myanmar text on the web that is designed for non-compliant fonts doesn't help make it easier to determine whether the filter is correct.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314968713, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056SFD_JftIXTWSw6Crvgb1j3-ZBT3ks5sNZ-XgaJpZM4N9Nel .

-- Ray.

theraysmith commented 7 years ago

That is still an open question. I have limited time to spend on it (therefore resistant to delaying tactics changing types in the dead code to POSIX). Whether enough uses of Tesseract can be covered by the new engine is still being debated, and the new models that I have need to be evaluated before enough of the community is convinced. I accept the requirement to add one or more new characters without the need for full retraining, and will not delete the legacy code until that need is addressed. (I think it can be done). The legacy code is used by the OSD model and deletion of the legacy code is also blocked by a good enough replacement.

On Thu, Jul 13, 2017 at 5:18 AM, Amit D. notifications@github.com wrote:

The new traineddatas will mostly be smaller than the older ones, as they won't contain the legacy components, and no bigram dawgs are needed.

Will you remove the code of the legacy engine in this round?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315060862, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056VPOW6xmGYPbAsOF_D3yEFAAfEshks5sNgr6gaJpZM4N9Nel .

-- Ray.

Shreeshrii commented 7 years ago

It seems that Malayalam is unique in allowing multiple 0d02 (Anusvara)?

That does not sound right. Please see https://en.wikipedia.org/wiki/Malayalam_script#Anusvaram

I did a search on ംം (two anusvarams in malayalam script) and most of them show up in the search result in pdfs.

FYI, pdfs created with documents having text in unicode fonts for complex scripts do not save the unicode text correctly. Devanagari text copied from these pdf is not correct, I assume similarly for malayalam and other Indian scripts, and that might be causing this double anusvar problem.

newer pdfs created in a special manner, eg. with 'actual text' with xelatex are ok (eg. http://sanskritdocuments.org/doc_devii/annapurna.pdf), but those created from various other software are not (http://www.sanskritweb.net/sansdocs/nala-d.pdf).

@jbreiden can give you the technical reasoning for this.

Google search does show pdfs as part of the search results, so there is some internal OCR (is it tesseract???) being done on the pdfs, books etc as part of the search process. But it may not be fully correct.

So for the corpus for training, I would suggest to avoid text taken from pdfs (in case it is being used).

Shreeshrii commented 7 years ago

@theraysmith Regarding Malayalam, double anusvara

Please see http://unicode.org/charts/PDF/U0D00.pdf http://www.alanwood.net/unicode/malayalam.html http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE 0D02 $ം MALAYALAM SIGN ANUSVARA • used in Prakrit language texts to indicate gemination of the following consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA 0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata

theraysmith commented 7 years ago

Direct from the unicode standard: Anusvara. The anusvara can be seen multiple times after vowels, whether independent letters or dependent vowel signs, as in vxxxx <0D08, 0D02, 0D02, 0D02, 0D02>. Vowel signs can also be seen after digits, as in 355wx <0033, 0035, 0035, 0D3E, 0D02>. More generally, rendering engines should be prepared to handle Malayalam letters (including vowel letters), digits (both European and Malayalam), dashes, U+00A0 no-break space and U+25CC dotted circle as base characters for the Malayalam vowel signs, U+0D4D malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malayalam sign visarga. They should also be prepared to handle multiple combining marks on those bases.

Is it wrong?

On Fri, Jul 14, 2017 at 12:00 AM, Shreeshrii notifications@github.com wrote:

@theraysmith https://github.com/theraysmith Regarding Malayalam, double anusvara

Please see http://unicode.org/charts/PDF/U0D00.pdf http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE 0D02 $ം MALAYALAM SIGN ANUSVARA • used in Prakrit language texts to indicate gemination of the following consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA 0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315286649, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056dtEgSZxvdhC-1CLOZ04nX1xheUPks5sNxIPgaJpZM4N9Nel .

-- Ray.

theraysmith commented 7 years ago

OK, I have pushed this week's changes: Fixes to pull from github. There were bugs introduced and required code deleted. Also reformatted/modified according to Google code standards. Major new normalization/text cleanup code in training/validat* The best help with this would be expertise in the various scripts, as previously discussed. Deleted some code from the LSTM recognizer that was old and unused. (Backwards compatible change). Part 1 of the changes required to move the unicharset and recoder so they are stored in the traineddata and therefore accessible.

I have not searched through my emails to find the relevant issues to update them yet. The traineddatas and training source data are not yet updated. That is probably a while away yet, so the issue about the unicharset and recoder are not yet fully resolved anyway. The training process shouldn't be broken by these changes yet, I hope, but the documentation is no longer accurate. If you run a new training or incremental/fine tuning training, the new output files will be a traineddata directly, not an LSTM traineddata component. That output traineddata should contain some version string and separate lstm unicharset/recoder.

The next step is to change the lstmtraining program to accept a traineddata instead of a unicharset, and add a tool to generate the traineddata, then update the documentation to match.

On Fri, Jul 14, 2017 at 8:52 AM, Ray Smith rays@google.com wrote:

Direct from the unicode standard: Anusvara. The anusvara can be seen multiple times after vowels, whether independent letters or dependent vowel signs, as in vxxxx <0D08, 0D02, 0D02, 0D02, 0D02>. Vowel signs can also be seen after digits, as in 355wx <0033, 0035, 0035, 0D3E, 0D02>. More generally, rendering engines should be prepared to handle Malayalam letters (including vowel letters), digits (both European and Malayalam), dashes, U+00A0 no-break space and U+25CC dotted circle as base characters for the Malayalam vowel signs, U+0D4D malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malayalam sign visarga. They should also be prepared to handle multiple combining marks on those bases.

Is it wrong?

On Fri, Jul 14, 2017 at 12:00 AM, Shreeshrii notifications@github.com wrote:

@theraysmith https://github.com/theraysmith Regarding Malayalam, double anusvara

Please see http://unicode.org/charts/PDF/U0D00.pdf http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE 0D02 $ം MALAYALAM SIGN ANUSVARA • used in Prakrit language texts to indicate gemination of the following consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA 0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315286649, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056dtEgSZxvdhC-1CLOZ04nX1xheUPks5sNxIPgaJpZM4N9Nel .

-- Ray.

-- Ray.

theraysmith commented 7 years ago

Actually, I take that back. I don't think the output from --stop_training is different to what it was before. It is still an LSTM traineddata component.

On Fri, Jul 14, 2017 at 11:35 AM, Ray Smith rays@google.com wrote:

OK, I have pushed this week's changes: Fixes to pull from github. There were bugs introduced and required code deleted. Also reformatted/modified according to Google code standards. Major new normalization/text cleanup code in training/validat* The best help with this would be expertise in the various scripts, as previously discussed. Deleted some code from the LSTM recognizer that was old and unused. (Backwards compatible change). Part 1 of the changes required to move the unicharset and recoder so they are stored in the traineddata and therefore accessible.

I have not searched through my emails to find the relevant issues to update them yet. The traineddatas and training source data are not yet updated. That is probably a while away yet, so the issue about the unicharset and recoder are not yet fully resolved anyway. The training process shouldn't be broken by these changes yet, I hope, but the documentation is no longer accurate. If you run a new training or incremental/fine tuning training, the new output files will be a traineddata directly, not an LSTM traineddata component. That output traineddata should contain some version string and separate lstm unicharset/recoder.

The next step is to change the lstmtraining program to accept a traineddata instead of a unicharset, and add a tool to generate the traineddata, then update the documentation to match.

On Fri, Jul 14, 2017 at 8:52 AM, Ray Smith rays@google.com wrote:

Direct from the unicode standard: Anusvara. The anusvara can be seen multiple times after vowels, whether independent letters or dependent vowel signs, as in vxxxx <0D08, 0D02, 0D02, 0D02, 0D02>. Vowel signs can also be seen after digits, as in 355wx <0033, 0035, 0035, 0D3E, 0D02>. More generally, rendering engines should be prepared to handle Malayalam letters (including vowel letters), digits (both European and Malayalam), dashes, U+00A0 no-break space and U+25CC dotted circle as base characters for the Malayalam vowel signs, U+0D4D malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malayalam sign visarga. They should also be prepared to handle multiple combining marks on those bases.

Is it wrong?

On Fri, Jul 14, 2017 at 12:00 AM, Shreeshrii notifications@github.com wrote:

@theraysmith https://github.com/theraysmith Regarding Malayalam, double anusvara

Please see http://unicode.org/charts/PDF/U0D00.pdf http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE 0D02 $ം MALAYALAM SIGN ANUSVARA • used in Prakrit language texts to indicate gemination of the following consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA 0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315286649, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056dtEgSZxvdhC-1CLOZ04nX1xheUPks5sNxIPgaJpZM4N9Nel .

-- Ray.

-- Ray.

-- Ray.

Shreeshrii commented 7 years ago

Ray, You are right. Looks like Malayalam does have different rules, including repeated vowels.

Please see section 8.4.3 in http://thottingal.in/documents/Fontbook.pdf by @santhoshtr.

In samvruthokaram - ◌ു് virama is applied to a vowel sign

Another exception is у. This combination of a long vowel sign and anusvara is used to denote "nth" like, 16у or 16-у meaning 16th.

Repeated vowel signs are used to denote elongation of a vowel pronunciation

Request Santhosh Thottingal @santhoshtr to comment regarding multiple anusvars.

Shreeshrii commented 7 years ago

See https://github.com/tesseract-ocr/langdata/issues/35#issuecomment-320330996

for Ray's comments about next set of changes

Shreeshrii commented 7 years ago

See https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-dev/_s0TOmDlEAs/uRJ-Ozi8AAAJ for updates -msgs from Jeff Breidenbach


Aug 28, 2017

Alexander Pozdnyakov has done a really good job packing Tesseract in his Personal Package Archive (PPA). I think it is getting to be time for wider usage, so I'm working with him to promote these to official packages. First step is Debian Experimental. That's a good place to work out problems, and hopefully something can be ready for real users within a few weeks.


Sep 7, 2017

we will have three sets of .traineddata files on GitHub in three separate repositories. Most users will want LSTM Fast and that is what will be shipped as part of Linux distributions. LSTM Best is for people willing to trade a lot of speed for slightly better accuracy. It is also better for certain retraining scenarios for advanced users. The third set is for the legacy recognizer.


Sep 15, 2017

Populated the new repositories, and removed the LSTM files from tessdata. I'm sure documentation needs updating.

Shreeshrii commented 7 years ago

https://github.com/tesseract-ocr/tessdata_best https://github.com/tesseract-ocr/tessdata_fast and https://github.com/tesseract-ocr/tessdata

stweil commented 7 years ago

2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer.

@theraysmith, thanks for providing "fast" now. Are you planning to release free documentation / tools for everybody to produce "fast" data? I noticed that apart from the LSTM model the rest of the traineddata files for "best" and "fast" are identical. Wouldn't it save space and make the handling easier if both variants were in the same traineddata container (this requires an option to select the desired one, of course) instead of having two parallel sets?

jbreiden commented 7 years ago

Sorry, I don't follow. Which parts are identical?

$ du -sh best fast
1.7G    best
657M    fast
amitdo commented 7 years ago

Jeff, see this comment: https://github.com/tesseract-ocr/tesseract/issues/1131#issuecomment-329764356

amitdo commented 7 years ago

1.7G best 657M fast

Jeff, playing with the numbers? :-) [He changed the numbers in his comment]

roozgar commented 7 years ago

​but its size is very different and dont follow an unique pattern..

tessdata_fast/eng.traineddata 3.9mb tessdata_best/eng.traineddata 14.7

tessdata_fast/ara.traineddata 1.4mb tessdata_best/ara.traineddata 12mb ​

​what can effect the fast traindata size ?1​

stweil commented 7 years ago

@roozgar, it is possible to extract the parts of a traineddata file using combine_tessdata -u traineddata_file output_path_prefix. Usually the largest parts are the LSTM model and the word list, but not all languages have a huge word list like eng.traineddata or Latin.traineddata.

jbreiden commented 7 years ago

@amitdo I can't seem to write a single comment without editing it three times to fix mistakes. @roozgar Models using integer arithmetic (traineddata_fast) are smaller than ones using floating point.

amitdo commented 7 years ago

​ @amitdo I can't seem to write a single comment without editing it three times to fix mistakes.

LOL. It happens to me too. I keep discovering mistakes after I post a comment.

Shreeshrii commented 7 years ago

@stweil You had asked somewhere about tools for converting to fast/integer models... Can't find that comment to reply to. The training wiki has the answer ...

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line

stop_training bool false Convert the training checkpoint in --continue_from to a recognition model.
convert_to_int bool false With stop_training, convert to 8-bit integer for greater speed, with slightly less accuracy.
stweil commented 7 years ago

Thank you!

WilliamTambellini commented 7 years ago

Hi everybody, so what are now the remaining tasks in order to release Tesseract 4.0.0 ?

Shreeshrii commented 6 years ago

Please see https://groups.google.com/d/msgid/tesseract-dev/2703d7a2-44e4-493c-a2fe-86891e2f0933%40googlegroups.com for comments from Jeff regarding debian and ubuntu release

Shreeshrii commented 6 years ago

Copied part of msg from @jbreiden

"To give a small update, a Dec 15 git snapshot is now shipping as part of Debian Unstable and Debian Testing. I expect it to be part of Ubuntu 18.04 (releasing in April 2018) but has not yet been integrated there. Thank you again to Alexander for doing 99% of the work with his PPA.

If I am reading these survey numbers right, Tesseract is installed on 8% of Debian systems, and executed recently on 2% of them. There are now 347 packages that depend on Tesseract, with 6 of them being direct dependencies.

https://qa.debian.org/popcon.php?package=tesseract

If anyone notices any problems with any of these packages, this is a very good time to speak up."

amitdo commented 6 years ago

$ tesseract --oem 0 phototest.tif - - Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.

~~The right usage is: tesseract --oem 0 -l osd phototest.tif -~~

Edit: I confused 'oem' with 'psm'.

Shreeshrii commented 6 years ago

@amitdo The 'fast' and 'best' traineddata files do not contain legacy model. Hence --oem 0 and --oem 2 will not work.

#!/bin/bash
img_files=$(ls ./Cap*.png)
for img_file in ${img_files}; do
  echo "****************************" ${img_file} oem 2"**********************************"
    time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata_best/   ${img_file} ${img_file%.*}-eng-best  --oem 2 --psm 6 -l eng
    time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata_fast/   ${img_file} ${img_file%.*}-eng-fast  --oem 2 --psm 6 -l eng
    time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata/   ${img_file} ${img_file%.*}-eng  --oem 2 --psm 6 -l eng
done
root@All-in-1-Touch:/mnt/c/Users/User/shree# bash ./tess.sh
**************************** ./Capture.png oem 1**********************************
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

real    0m4.469s
user    0m11.375s
sys     0m0.406s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

real    0m2.209s
user    0m3.797s
sys     0m0.234s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

real    0m3.785s
user    0m8.219s
sys     0m0.531s

root@All-in-1-Touch:/mnt/c/Users/User/shree# bash ./tess.sh
**************************** ./Capture.png oem 2**********************************
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

real    0m0.621s
user    0m0.078s
sys     0m0.297s
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

real    0m0.425s
user    0m0.031s
sys     0m0.125s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

real    0m3.772s
user    0m7.969s
sys     0m0.578s
Shreeshrii commented 6 years ago

@jbreiden

I was suggesting an error/warning message when --oem 0 or 2 are used with 'best' or 'fast' traineddata, which is converse of the following:

https://github.com/tesseract-ocr/tesseract/blob/dc8745e6fd4c6c070076c44565924faa0d0643a7/ccmain/tessedit.cpp#L187

https://github.com/tesseract-ocr/tesseract/blob/dc8745e6fd4c6c070076c44565924faa0d0643a7/ccmain/tessedit.cpp#L196

      tprintf("Error: LSTM requested, but not present!! Loading tesseract.\n");
      tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_ONLY);
amitdo commented 6 years ago

The 'fast' and 'best' traineddata files do not contain legacy model. Hence --oem 0 and --oem 2 will not work.

I know, I even added this info to the wiki a while ago.

Still,

tesseract --psm 0 -l osd phototest.tif -

should work.