German Fraktur - Githubissues

amitdo commented 7 years ago

From https://github.com/tesseract-ocr/tesseract/issues/40

@stweil commented

Are there also new data files planned for old German (deu_frak)? I was surprised that the default English model with LSTM could recognize some words.

@theraysmith commented

I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand. With English at least, the language was different in the days of Fraktur (Ye Olde shoppe). I know German continued to be written in Fraktur until the 1940s, so that might be easier. Or is there an old German that is analogous to Ye Old Shoppe for English

stweil commented

Fraktur was used for an important German newspaper (Reichsanzeiger) until 1945. I'd like to try some pages from that newspaper with Tesseract LSTM. Surprisingly even with the English data Tesseract was able to recognize at least some words written in Fraktur.

There is an Old High German (similar to Old English), but the German translation of the New Testament by Martin Luther (1521) was one of the first major printed books in German, and basically it started the modern German language (High German) which is used until today.

@jbaiter commented

I have a decent amount of corpus data for Fraktur from scanned books at hand, about 500k lines in hOCR files (~50GB with TIF images). I've yet to publish it, but if you have somewhere where I could send/upload it, I'd be glad to.

theraysmith commented

The md file documents the training process in tutorial detail, but line boxes and transcriptions sounds perfect!

300k lines should make it work really well. I would be happy to take it and help you, but we would have to get into licenses, copyright and all that first. For now it might be best to hang on for the instructions.

jbaiter commented

The text is CC0 and the images are CC-BY-NC, so that shouldn't be an issue :-) They're going to be public anyway once I've prepped the dataset for publication.

amitdo commented 7 years ago

@jbaiter,

I suggest you to upload the textual part to a GitHub repo. Add the CC0 license info and mention the source of the data (which books and/or newspapers were used, who transcribed them).

Hopefully, Ray will use it to train a new (LSTM) deu_frak trainedata.

theraysmith commented 7 years ago

I found a problem with the synthetic training pipeline. The fraktur fonts were only about 1% of the training data, even for the frk language. This will be fixed in my next training, which I hope to start this week (and for the past 4 weeks).

I'm also going to fix the single char/single word issue that was raised as an objection to deleting the legacy engine.

There will also be major changes to the Indic training data, but I have no idea whether it will affect the accuracy, as it still doesn't work properly...

I now have a lot more training data for even the languages where before I said I didn't have much.

On Mon, Mar 13, 2017 at 7:24 AM, Amit D. notifications@github.com wrote:

@jbaiter https://github.com/jbaiter,

I suggest you to upload the textual part to a Github repo. Add the CC0 license info and mention the source of the data (which books and/or newspapers were used, who transcribed them).

Hopefully, Ray will use it to train a new (LSTM) deu_frak trainedata.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-286122105, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056QyLxgTF5MGsbrwEzgC2uQYhyKy_ks5rlVGdgaJpZM4MbSSf .

-- Ray.

stweil commented 7 years ago

Ray, could you please have a look on the questions which I have sent to the tesseract dev forum regarding quality of the training data and characters used for the training, ideally before you start a new training? See also issue #55 (which also applies in similar form to any other European language, even eng: all those languages currently use incomplete character sets).

Shreeshrii commented 7 years ago

I have done a legacy training using the existing deu_frak box/tiff pairs, a few box/tif pairs from emop and some synthetic tif/box pairs using fonts.

Traineddata is attached. @stweil, you can check how it compares to old deu_frak as well as your trials at training.

deu_frak.zip

stweil commented 7 years ago

At a first glance there is no clear winner. Your deu_frak.traineddata improves the recognition for some characters / words, but also produces words which exist in German language but don't match the image. Some of my experiments with the legacy training based on frk gave similar results, two of them look better. I'll continue those tests and report more precise data later.

stweil commented 7 years ago

P.S. I was on an OCR workshop in Würzburg for the last two days where character recognition rates of up to 98 % were reported for OCR of Fraktur. All my current results are far away from such precision.

Shreeshrii commented 7 years ago

produces words which exist in German language but don't match the image.

Yes, I had noticed that with Devanagari script with the legacy traineddata also.

Could it be related to the dictionary/wordlists/dawgs?

Hope you are able to get improved accuracy with your training for Fraktur.

amitdo commented 7 years ago

I was on an OCR workshop in Würzburg for the last two days where character recognition rates of up to 98 % were reported for OCR of Fraktur. All my current results are far away from such precision.

Lies, damned lies, and statistics

Don't believe unless you can test it yourself on a large and diverse dataset.

Shreeshrii commented 7 years ago

German American Newspapers

https://collection1.libraries.psu.edu/cdm/search/collection/frak/searchterm/newspapers

amitdo commented 7 years ago

I was on an OCR workshop in Würzburg for the last two days where character recognition rates of up to 98 % were reported for OCR of Fraktur. All my current results are far away from such precision.

https://arxiv.org/pdf/1701.07395.pdf

They say 97% character accuracy rate, after training 400 lines with ocropy.

Shreeshrii commented 7 years ago

This is training specific to one book/font. Tesseract does generalized training with many fonts.

Have you had any success running ocropus ?

excuse the brevity, sent from mobile

On 23-Mar-2017 9:50 PM, "Amit D." notifications@github.com wrote:

I was on an OCR workshop in Würzburg for the last two days where character recognition rates of up to 98 % were reported for OCR of Fraktur. All my current results are far away from such precision.

https://arxiv.org/pdf/1701.07395.pdf

They say 97% character accuracy rate, after training 400 lines with ocropy.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-288774659, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o_rrfHURN9izQg2DkXOai68NVc4Oks5ropvPgaJpZM4MbSSf .

amitdo commented 7 years ago

This is training specific to one book/font. Tesseract does generalized training with many fonts.

So does ocropy. But you can improve the results if you train for a specific book. I would use the generic trained data and build upon it.

stweil commented 7 years ago

Building upon existing trained data is currently not possible because that data does not include all needed characters, and adding characters is unsupported with LSTM.

amitdo commented 7 years ago

Still, you can make a generic trained data yourself with all the characters you want from a large set of digital fonts, and then fine tune with 100-400 lines from a book/newspaper. This is relevant to both Tesseract and ocropus.

Shreeshrii commented 7 years ago

Amit, LSTM process for training from scanned images is not defined yet.

excuse the brevity, sent from mobile

On 24-Mar-2017 1:49 AM, "Amit D." notifications@github.com wrote:

Still, you can make a generic trained data yourself with all the the chars you wants from digital fonts, and then fine tune with 100-400 lines from a book/newspaper. This is relevant to both Tesseract and ocropus.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-288847479, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyMhSkopCY8EWq3EY7VjPq-GtEOvks5rotO0gaJpZM4MbSSf .

Shreeshrii commented 7 years ago

https://github.com/jze/ocropus-model_fraktur

This is a character model for recognizing Fraktur font with OCRopus. With test data from a book that has not been used in the training process it yields an excellent error rate of 0.296%. It is slightly better that the 'standard' Fraktur model which has an error rate of 0.466%.

amitdo commented 7 years ago

In the case you mention the difference is insignificant. I read some reports on a much larger difference.

theraysmith commented 7 years ago

After a lot of work, and a very long delay, the new training is almost ready to go. Just waiting for rendering to finish...

Fixes in this round: Utilizes a new crawl of the www for ~60 languages that had the least training data, and ~15 new languages that we didn't have before. This provides much more training data, with better estimates of what is in the character set and better wordlists. I've just checked over this thread, the thread on tesseract-dev, and issue 55, and all the requested missing characters will be in. frk, enm, frm, ita_old, and spa_old will all have much better response to Fraktur, and probably worse response to non-Fraktur. Previously there was a bug, and <1% of training images were Fraktur. Now it will be more like 75%. New and improved text filters for languages that use a "virama" character. The training data for all the Indic languages is thus much cleaner, but until it is trained, I have no idea of the effect on accuracy. Single character/grapheme and single word entries are added to the training sets, which should improve accuracy on shorter lines.

I've also added an experiment to throw all the Latin languages together into a single engine. (Actually a separate model for each of 36 scripts). If that works it will solve the problem of reading Citroen in German and picking up the e umlaut. The downside is that this model has almost 400 characters in it, despite carefully keeping out the long-tail graphics characters. Even if it does work, it will be slower, but possibly not much slower than running 2 languages. It will have about 56 languages in it. I have some optimism that this may work, ever since I discovered that the vie LSTM model gets the phototest.tif image 100% correct.

I'm also experimenting with a new osd model. It has to be replaced to eliminate the old engine.

On Fri, Mar 24, 2017 at 2:27 AM, Amit D. notifications@github.com wrote:

In the case you mention the difference is insignificant. I read some reports on a much larger difference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-288973591, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056aIgV_fkwMhJIAlZXkqjBRpd1cqbks5ro4ycgaJpZM4MbSSf .

-- Ray.

amitdo commented 7 years ago

Ray, Thanks for your hard work and thanks for this update!

Shreeshrii commented 7 years ago

Thanks for the update and your work on this, Ray.

Just checking whether this new training will also address:

Devanagari transliterated in Roman script with accents eg. http://www.claysanskritlibrary.org/excerpts/CSLFrontMatter.pdf
Correct handling of superscripts, TM and other signs
Traineddata for MICR
Traineddata for Seven Segment (or 14 segment) Display
Allow for whitelisting/blacklisting to ensure only numeric results.

I look forward to testing with the newer code and Indic traineddata.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 30, 2017 at 3:12 AM, theraysmith notifications@github.com wrote:

After a lot of work, and a very long delay, the new training is almost ready to go. Just waiting for rendering to finish...

Fixes in this round: Utilizes a new crawl of the www for ~60 languages that had the least training data, and ~15 new languages that we didn't have before. This provides much more training data, with better estimates of what is in the character set and better wordlists. I've just checked over this thread, the thread on tesseract-dev, and issue 55, and all the requested missing characters will be in. frk, enm, frm, ita_old, and spa_old will all have much better response to Fraktur, and probably worse response to non-Fraktur. Previously there was a bug, and <1% of training images were Fraktur. Now it will be more like 75%. New and improved text filters for languages that use a "virama" character. The training data for all the Indic languages is thus much cleaner, but until it is trained, I have no idea of the effect on accuracy. Single character/grapheme and single word entries are added to the training sets, which should improve accuracy on shorter lines.

I've also added an experiment to throw all the Latin languages together into a single engine. (Actually a separate model for each of 36 scripts). If that works it will solve the problem of reading Citroen in German and picking up the e umlaut. The downside is that this model has almost 400 characters in it, despite carefully keeping out the long-tail graphics characters. Even if it does work, it will be slower, but possibly not much slower than running 2 languages. It will have about 56 languages in it. I have some optimism that this may work, ever since I discovered that the vie LSTM model gets the phototest.tif image 100% correct.

I'm also experimenting with a new osd model. It has to be replaced to eliminate the old engine.

On Fri, Mar 24, 2017 at 2:27 AM, Amit D. notifications@github.com wrote:

In the case you mention the difference is insignificant. I read some reports on a much larger difference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59# issuecomment-288973591, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056aIgV_ fkwMhJIAlZXkqjBRpd1cqbks5ro4ycgaJpZM4MbSSf .

-- Ray.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290235084, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_owOP5L5IOQL1iMwPWFjpLSwx14dAks5rqtAugaJpZM4MbSSf .

theraysmith commented 7 years ago

On Wed, Mar 29, 2017 at 9:32 PM, Shreeshrii notifications@github.com wrote:

Thanks for the update and your work on this, Ray.

Just checking whether this new training will also address:

Devanagari transliterated in Roman script with accents eg. http://www.claysanskritlibrary.org/excerpts/CSLFrontMatter.pdf

Will probably be handled by the 'Latin' language.

Correct handling of superscripts, TM and other signs

Beyond the scope of this change. Sub/superscript are much harder to deal with, as they have to be trained, and that means incorporating them correctly into the training path, and how to pass the information back out of the line recognizer to the output. At the moment it seems the iterator supports discovery of sub/super, but there is no output renderer that handles it. (Not even hocr?) TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output. Question: For which languages/scripts is is desirable to support sub/super?

Traineddata for MICR

Beyond the scope of this change.

Traineddata for Seven Segment (or 14 segment) Display

Beyond the scope of this change.

Allow for whitelisting/blacklisting to ensure only numeric results.

A simple code change not related to training.

I look forward to testing with the newer code and Indic traineddata.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Mar 30, 2017 at 3:12 AM, theraysmith notifications@github.com wrote:

After a lot of work, and a very long delay, the new training is almost ready to go. Just waiting for rendering to finish...

Fixes in this round: Utilizes a new crawl of the www for ~60 languages that had the least training data, and ~15 new languages that we didn't have before. This provides much more training data, with better estimates of what is in the character set and better wordlists. I've just checked over this thread, the thread on tesseract-dev, and issue 55, and all the requested missing characters will be in. frk, enm, frm, ita_old, and spa_old will all have much better response to Fraktur, and probably worse response to non-Fraktur. Previously there was a bug, and <1% of training images were Fraktur. Now it will be more like 75%. New and improved text filters for languages that use a "virama" character. The training data for all the Indic languages is thus much cleaner, but until it is trained, I have no idea of the effect on accuracy. Single character/grapheme and single word entries are added to the training sets, which should improve accuracy on shorter lines.

I've also added an experiment to throw all the Latin languages together into a single engine. (Actually a separate model for each of 36 scripts). If that works it will solve the problem of reading Citroen in German and picking up the e umlaut. The downside is that this model has almost 400 characters in it, despite carefully keeping out the long-tail graphics characters. Even if it does work, it will be slower, but possibly not much slower than running 2 languages. It will have about 56 languages in it. I have some optimism that this may work, ever since I discovered that the vie LSTM model gets the phototest.tif image 100% correct.

I'm also experimenting with a new osd model. It has to be replaced to eliminate the old engine.

On Fri, Mar 24, 2017 at 2:27 AM, Amit D. notifications@github.com wrote:

In the case you mention the difference is insignificant. I read some reports on a much larger difference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59# issuecomment-288973591, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056aIgV_ fkwMhJIAlZXkqjBRpd1cqbks5ro4ycgaJpZM4MbSSf .

-- Ray.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59# issuecomment-290235084, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ owOP5L5IOQL1iMwPWFjpLSwx14dAks5rqtAugaJpZM4MbSSf .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290299678, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056UXLwhYOZQxRJIEN2gKCJcUgKtFZks5rqzBIgaJpZM4MbSSf .

-- Ray.

Shreeshrii commented 7 years ago

Ray, Thanks for your prompt response.

I hope you have also noted that language code frk is for Frankish, which is not the same as German Fraktur. It maybe helpful to update the langdata and add deu_frak, dan_frak etc, or at least a generic frak similar to latn and deva.
I am trying to do finetune training for seven segment display using eng.traineddata as the base and training text in about 10 SSD fonts with numbers and CAPITAL letters. Is that the recommended strategy or would replacing a layer give better results?

Also, should any kind of wordlist/dictionary be included for what maybe random combinations of letters and numbers?

Regarding superscripts/subscripts etc, I can point out three cases based on the languages I know.

a. English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.

b. Tamil - Sanskrit texts transliterated in Tamil scripts use superscripts/subscripts 2,3,4 (sometimes 1 also) to distinguish between different sounds (to support sanskrit alphabet which does not have direct mapping in Tamil script). These can actually be in middle of Tamil words.

c. Hindi, Sanskrit and other Indian languages - Hindi books, thesis etc use superscripts for referring to footnotes (similar to English above). The difference is that in some cases these will be using the Latin alphabet 0-9 and in some cases using Devanagari digits (in case of Hindi, Sanskrit etc). Unicode has superscripts 0-9 for Latin script but not for Devanagari script. I would suggest support for the Latin script superscript numbers.

Scanned pages with devanagari superscripts should also be mapped to the Latin script superscript numbers. Similarly for other Indian languages.

TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output.

Is this controlled via the normalized form in the unicharset? Can different processing be applied based on the normalized form there?

thanks!

stweil commented 7 years ago

language code frk is for Frankish, which is not the same as German Fraktur

As the current data tried to implement German Fraktur, renaming frk to deu_frak might be the simplest fix for the moment.

English - books, thesis etc. have a number of footnotes referred to in the text with superscripts. I guess this will apply to all languages written in Latin script. Usually this will be at end of words.

At least it applies to German. There are also superscripts after punctuation characters at the end of sentences.

Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹, ² or ³.

Shreeshrii commented 7 years ago

See page 3 in http://sanskritdocuments.org/doc_ganesha/gaNanAyak8-ta.pdf for superscripts usage in Tamil.

Sample of subscript numbers usage in Tamil - http://srivaishnavam.com/stotras/sristuti_tamil.pdf

Shreeshrii commented 7 years ago

Please see https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts

Unicode has subscripted and superscripted versions of a number of characters including a full set of Arabic numerals.

The most common superscript digits (1, 2, and 3) were in ISO-8859-1 and were therefore carried over into those positions in the Latin-1 range of Unicode. The rest were placed in a dedicated section of Unicode at U+2070 to U+209F.

Shreeshrii commented 7 years ago

Should all superscripts be handled in the same way, or do we need a different handling for those superscripts which have a special UTF-8 code like ¹, ² or ³.

All superscripts have a special UTF-8 code, though in different ranges. Not all fonts have support for all superscripts and subscripts.

http://www.alanwood.net/unicode/latin_1_supplement.html

http://www.alanwood.net/unicode/superscripts_and_subscripts.html

amitdo commented 7 years ago

I opened a new issue for 'Superscripts & suberscripts' at #62

Shreeshrii commented 7 years ago

language code frk is for Frankish, which is not the same as German Fraktur As the current data tried to implement German Fraktur, renaming frk to deu_frak might be the simplest fix for the moment.

I agree.

amitdo commented 7 years ago

I opened a new issue for 'Correct handling of TM sign' at #63

theraysmith commented 7 years ago

Thanks for opening the new issues 62, 63. I will continue to think about the best approach. I tried to include TM in the current round of training, but it is too infrequent to have made the cut line. I will have to add it to the desired_characters list. Where is frk documented as Frankish? It does NOT occur in my usual reference: https://www.loc.gov/standards/iso639-2/php/code_list.php We have inconsistent naming for the old versions of european languages: enm frm ita_old spa_old frk How would it suit to have a generic "Fraktur" language that covers all of these, and trained with ~50% Fraktur fonts and 50% the other 4500 Latin fonts?

On Fri, Mar 31, 2017 at 2:07 AM, Amit D. notifications@github.com wrote:

I opened a new issue for 'Correct handling of TM sign' at #63 https://github.com/tesseract-ocr/langdata/issues/63

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290660213, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056QOzlayilZbgoAlpAnsaMIp_5cmUks5rrMJKgaJpZM4MbSSf .

-- Ray.

stweil commented 7 years ago

How would it suit to have a generic "Fraktur" language that covers all of these, and trained with ~50% Fraktur fonts and 50% the other 4500 Latin fonts?

That might be very interesting (similar to the generic model for modern European languages). It would still be possible to replace the language specific parts of the traineddata file.

There is also a large number of publications (mostly Latin, but also English, German, Spanish, French, Italian) from 16th until 18th century which use "normal" Latin fonts (Antiqua) with additional characters and ligatures not found in modern texts. Especially the long S character (ſ) is very frequent in those old texts. For those, a generic training might also help.

amitdo commented 7 years ago

Where is frk documented as Frankish? It does NOT occur in my usual reference: https://www.loc.gov/standards/iso639-2/php/code_list.php

Library of Congress is the ISO 639-2 Registration Authority. SIL International is the ISO 639-3 Registration Authority.

http://www-01.sil.org/iso639-3/ http://www-01.sil.org/iso639-3/codes.asp?order=639_3&letter=f http://www-01.sil.org/iso639-3/documentation.asp?id=frk

amitdo commented 7 years ago

ISO 15924 has 'Latf' for 'Latin, Fraktur' variant. https://en.wikipedia.org/wiki/ISO_15924:Latf

amitdo commented 7 years ago

BCP 47 - Language tags. Used in HTML and XML - should be used in hOCR.

https://kba.github.io/hocr-spec/1.2/#sec-lang https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes https://tools.ietf.org/html/bcp47 http://www.iana.org/assignments/language-subtag-registry

amitdo commented 7 years ago

Another link:

http://www.langtag.net/

This Web site is a work of the IETF LTRU working group.

stweil commented 7 years ago

@theraysmith, can you already estimate when the updated traineddata files will be available on GitHub?

stweil commented 7 years ago

New traineddata is now available.

I've also added an experiment to throw all the Latin languages together into a single engine.

The new best/Fraktur.traineddata is such a composition of Latin languages. Ray, basically it works good (much better than any other Tesseract model for Fraktur before), also with text parts written in Antiqua, but it still does not include the paragraph character: § is detected as S or $. How do you test the completeness of characters in your training set?

amitdo commented 7 years ago

I was on an OCR workshop in Würzburg for the last two days where character recognition rates of up to 98 % were reported for OCR of Fraktur. All my current results are far away from such precision.

Stefan, do you reach 98% CER now, with the best traineddata?

stweil commented 7 years ago

:-). Sure, but only with selected images. In addition to the problems noted above, best/Fraktur tends to confuse upper case S with lower case s, resulting in something like stefan instead of Stefan which is not nice (and surprising because the Fraktur character images for S and s are totally different). best/frk does not show that effect.

stweil commented 7 years ago

Ray, you mentioned that training additional characters is possible now. How can I add the missing § myself?

amitdo commented 7 years ago

best/Fraktur tends to confuse upper case S with lower case s, resulting in something like stefan instead of Stefan which is not nice (and surprising because the Fraktur character images for S and s are totally different). best/frk does not show that effect.

Fraktur was trained with English too, so I guess it learns it from there.

amitdo commented 7 years ago

How can I add the missing § myself?

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

stweil commented 7 years ago

Thanks for the link – I'll try that and report the result.

amitdo commented 6 years ago

https://github.com/jbaiter/archiscribe-corpus

stweil commented 6 years ago

@amitdo, thank you for the link. @jbaiter, thank you for uploading the data. Can you estimate the quality of the data? From our own experience I know that getting very high quality ground truth is difficult. My first random look on your data was on a line with 38 characters, while the original line has 39 characters: the transcription missed a clearly visible 'e' character.

jbaiter commented 6 years ago

@stweil It's crowdsourced (currently primarily by me...) and thus does not really claim to be highly accurate at all. There is however an editing interface available at https://archiscribe.jbaiter.de ("Bisherige Transkriptionen anzeigen") where you can go correct mistakes like that. Of course Pull Requests are always welcome!

soloturn commented 5 years ago

@stweil @zdenop @amitdo how can i help here? i tried to convert "stolz und vorurtheil" (pride and prejudice) from here: https://de.wikisource.org/wiki/Jane_Austen

what parameters would you use to get a reasonable output?

stweil commented 5 years ago

The results with Tesseract and the Fraktur models (-l frk or -l script/Fraktur) from an image without preprocessing are less good than the existing OCR.

To get better results with Tesseract, you could try these things:

get scans with higher resolution
preprocess images (they show lots of speckles)
use other tools for line detection and apply Tesseract on single lines
train Tesseract to improve the recognition rate

soloturn commented 5 years ago

cool stefan thanks for the link!! will try to put this thing into wikisource to get it corrected. can you print a full tesseract or other command line making use of tesseract to use to get the text out of this pdf please? i am beginner and failed on it ...

stweil commented 5 years ago

If you are only interested in the text, you can get it directly – no need to extract it from the PDF.

tesseract-ocr / langdata

German Fraktur #59