tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 153 forks source link

Wordlists and training texts contain lots of errors #1

Open stweil opened 6 years ago

stweil commented 6 years ago

A short test with codespell (which only finds the most common typos for English) found more than 1000 errors in eng.wordlist.

The German wordlist deu.wordlist contains the well known B / ß confusion and also other errors.

The training texts also contain similar errors. In addition, I noticed many foreign (Turkish?) words in the German text.

Are such errors critical for the trained model which is based on that data?

amitdo commented 5 years ago

The word lists and trained text were generates by using a web crawler. Some filtering was done as a post processing step.

So the undesirable effects you mentioned are to be expected.

stweil commented 5 years ago

Using a web crawler on German texts will normally not find words like "drauBen" (instead of "draußen"), unless you crawl OCR results which were made with English language settings. It looks like Ray crawled Google Books. What happens if Google learns from Google? At some time there will be lots of evidence that "drauBen" is correct. :-) Searching for "drauBen" (with Google Search of course) already finds texts outside of Google Books, but maybe generated by Google Translate.

So using a web crawler is fine as long as it only crawls more reliable content (German text corpora, German Wikipedia, German newspapers, German books from Wikisource or Project Gutenberg, ...).

amitdo commented 5 years ago

https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

theraysmith commented on Jan 23, 2017

The text corpus is from all the www, taken several years ago, plus more recent data from wiki-something. The text is divided by language automatically, so there is a separate stream for each of the Devanagari-based languages (as there is for the Latin-based languages) and clipped to 1GB for each language. For each language, the text is frequency counted and cleaned by multiple methods, and sometimes this cleaning is too stringent automatically, or not stringent enough, so forbidden_characters and desired_characters are used as a guide in the cleanup process. There are other lang-specific numbers like a 1-in-n discard ratio for the frequency. For some languages, the amount of data produced at the end is very thin.

The unicharset is extracted from what remains, and the wordlist that is published in langdata. For the LSTM training, I resorted to using Google's parallel infrastructure to render enough text in all the languages. However much or little corpus text there is, the rendering process makes 50000 chunks of 50 words to render in a different combination of font and random degradation, which results in 400000-800000 rendered textlines. The words are chosen to approximately echo the real frequency of conjunct clusters (characters in most languages) in the source text, while also using the most frequent words.

This process is all done without significant manual intervention, but counts of the number of generated textlines indicates when it has gone badly, usually due to a lack of fonts, or a lack of corpus text. I recently stopped training chr, iku, khm, mya after discovering that I have no rendered textlines that contain anything other than digits and punctuation.

Community input is therefore extremely useful, and usually results in edits to forbidden_characters and desired_characters, which in turn guides the filtration process. Community-provided corpus text would be useful for languages that have very little or no training data, given appropriate copyright/licensing clearance.

amitdo commented 5 years ago

wiki-something

Wikipedia? Other Wikimedia's wikis?

wrznr commented 5 years ago

Community-provided corpus text would be useful for languages

Let's say we provide corpus text. Is there only the slightest chance that retraining *.tessdata files is going to happen? Does anyone even know the necessary commands for rebuilding the models provided in the tessdata repos?

zdenop commented 5 years ago

IMO (I did not try it yet) it should be possible at least for LTSM: see wiki training-from-scratch. Experiences from training legacy engine (tesseract 3.x) were that nobody was able to achieve google trainined data results for standard fonts, so I would do not invest time to retrain legacy part (unless you have very specific font there current data provide bad results).

wrznr commented 5 years ago

Thanks for your estimation. I guess reproducing the current models would be very useful before trying to improve them. I'll give it a try. And yes, I am only interested in LSTM training.

stweil commented 5 years ago

My own experience with legacy training is different. It was quite easy to train a useable Fraktur model (frk.traineddata), but up to now I did not succeed in training a similar LSTM model from scratch.

Legacy training only requires a selection of good fonts and a short training text which includes all glyphs, so it is sufficient to make an artificial text listing those glyphs.

wrznr commented 5 years ago

Just to make sure: With reproducing, I refer to more or less exactly reproducing the current state of the stack models.

stweil commented 5 years ago

I am afraid that reproducing the current models won't be possible, maybe not even with Google internal information. If the text used for training was extracted from Internet sources (it looks like that), then that extraction cannot be reproduced. The original extracted text would be needed, also how it was distributed on the trained fonts and which parameters were used for text2image. If the distribution was random, it can only be reproduced if it used a pseudo randomness and if the random sequence is reproducible.

Most of the current models have known deficits, so maybe it is not a great loss if they cannot be reproduced exactly. The important thing is finding a way to get new models from scratch without those deficits, but with comparable or better quality, and with a 100 % defined training process.

zdenop commented 5 years ago

just by clear regarding my statement about legacy engine: Fraktur fonts belong to special fonts.

amitdo commented 5 years ago

Another issue is that some of the fonts they used for training are not open source fonts and cost some $$.

stweil commented 5 years ago

@wrznr, I think that Ray's statement is the best piece of information which we currently have on the training done by Google.

The text corpus is from all the www, taken several years ago, plus more recent data from wiki-something. The text is divided by language automatically, so there is a separate stream for each of the Devanagari-based languages (as there is for the Latin-based languages) and clipped to 1GB for each language.

A 1 GB text file for a single language which was taken from "all the www" is not only too large to be easily handled, but will also contain lots of copyrighted text. That might be a major reason why such files could not be shared.

wrznr commented 5 years ago

@stweil I missed that piece of information. Thanks. I always thought that the training texts would be part of the data repos. If this is not the case, I really think we should make an effort and come up with re-trainable alternatives. Wikipedia could be a good source for the texts.

stweil commented 5 years ago

The small training texts in the data repos were sufficient for the legacy model. I have no idea how the larger training texts in langdata_lstm were used at Google, but obviously they are much less than a gigabyte.

Wikipedia can contribute training text, but those text uses modern language and is not formatted like printed books. Wikisource offers older texts, and other projects (like Project Gutenberg) also offer the typical book layout. I expect a higher quality from those sources than from a more random www sample. Maybe we can also use other large existing text corpora.

wollmers commented 3 years ago

Just my 2 cents as comment to what the basic languages models should be:

1) modern language, let's define it for German as 1950 or later

Personally I gave up the idea to distinguish orthographies in the intervals of 1750, 1830, 1875, 1901 and 1996. Now I just divide my corpora into periods of 50 years like 1800-1849, 1850-1899, etc. It's always possible to combine them into longer periods.

Modern, because I assume the majority of users need modern language. Archives and libraries have other requirements and can help themselves.

2) training text

From all the available corpora which I know https://wortschatz.uni-leipzig.de/de/download provides random "proper" sentences of different sizes, domains (e.g. news, web, wikipedia). For German up to 300 M-sentences, which is IMHO not very handy to process. The license is friendly:

All corpora provided for download are licensed under CC BY.

Some what what 1 M-sentences mean:

deu-at_web-public_2019_1M

                           TOTAL         UNIQUE
words:                  18180427         900958 # tokens, i.e. including punctuation tokens
chars:                 100837015            636 # graphemes, but only a few with more than 1 codepoint
bigrams:                84990490           9719
trigrams:               70289490          91441
word size avg.:             5.55

Thus 1 M-sentences need ~100 MB. According to Zipf's law the average of ~18 tokens per sentence is very constant between the German corpora. The size of the alphabet (unique chars/graphemes) is very different, because some corpora include non-Latin scripts like Greek, Arabic, Hebrew, Cyrillic, Chinese, and emoticons too.

BTW: None of the corpora I know is free of spelling errors. Even DTA has still errors like Dundes- -> Bundes-, -uug -> -ung and many long-s/f mismatches. In a crawled corpus the errors would be more.

I am not sure if size matters for training. If there would be a gain in accuracy using 1 GB text versus 100 MB, or if it would degrade. Other works using CTC/(B)LSTM show a stagnation along increasing dictionary sizes up to 90 K-words (morphemes or surface forms). HMMs degrade early, but exactly this was the reason to use CTC.

3) character set

IMHO the current character set of deu.traineddata is too small. See https://github.com/tesseract-ocr/langdata_lstm/issues/45 for missing bullet. Some of the more frequent should be included like EM DASH. Also letters outside the official alphabet a-z äöüß A-Z ÄÖÜ to allow foreign words or names, if they appear in the training texts and are part of Latin-1 and Latin-2.