unicharset_extractor's output is broken for a particular case

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.generate tif/box pair with text2image for the attached text file using any 
suitable font that contains musical symbols (Bravura, Euterpe, GNU FreeSerif)

2.run unicharset_extractor on the resulting box file and examine the resulting 
unicharset file

What is the expected output? a valid unicharset file

What do you see instead? broken unicharset file containing obviously wrong 
words JOINED and BROKEN (see attachment)

What version of the product are you using? On what operating system?
The latest git clone, commit 4c7c960bfd57c5863fe639afab801080d9ef8bbe
Ubuntu 14.04 LTS

I don't have any clue what it all means but I cannot proceed with further 
training due to this issue.

Original issue reported on code.google.com by maximums...@googlemail.com on 22 Feb 2015 at 11:28

Attachments:

GoogleCodeExporter commented 9 years ago

Issue 1279 has been merged into this issue.

Original comment by zde...@gmail.com on 22 Apr 2015 at 8:17

GoogleCodeExporter commented 9 years ago

First: I don't think Tesseract is going to be particularly suitable for OMR; 
for one thing, OMR systems usually have a staff line removal process that 
Tesseract doesn't have. You might have better luck with OpenOMR 
(https://sourceforge.net/projects/openomr/) or Audiveris 
(https://audiveris.kenai.com/)

Second: I'm not sure what the significance of Joined and Broken are, but I 
think they need to be there. I created a traineddata file last week, and 
couldn't proceed without them.

Original comment by joregan on 13 May 2015 at 4:26

GoogleCodeExporter commented 9 years ago

1) It was not my intention to (mis)use Tesseract for OMR tasks. Our project - 
Audiveris - uses Tesseract for recognizing textual items. Musical scores often 
contain text strings with musical symbols inside. In the attached example there 
is a quarter note in the middle of a string. Other text strings containing 
musical symbols are often guitar chords, repeat indications etc.

Currently, running Tesseract on images containing musical symbols produces 
wrong characters. I would like to fix it by adding recognition of musical 
symbols to the OCR engine.

2) Could someone kindly explain me what these "Joined" and "Broken" indications 
mean? Is it an error or an expected behaviour? I wasn't able to find any 
documentation. It looks like I need to dig deeply into the (mostly 
undocumented) source code. 

My interpretation is that Tesseract OCR does currently support a small subset 
of the Unicode charset. The musical page seems to be not supported, hence these 
"Joined" and "Broken" words.

Thanks in advance for your clarification.
Max

Original comment by maximums...@googlemail.com on 13 May 2015 at 8:37

Attachments:

Moderato.tiff

GoogleCodeExporter commented 9 years ago

Aaah, ok. Years of seeing the weird things people ask about on the mailing list 
have made me a little skeptical, I guess :)

They are special characters for internal use. In ccutil/unicharset.cpp, there's:

// List of strings for the SpecialUnicharCodes. Keep in sync with the enum.
const char* UNICHARSET::kSpecialUnicharCodes[SPECIAL_UNICHAR_CODES_COUNT] = {
    " ",
    "Joined",
    "|Broken|0|1"
};

but I can't see anything in particular beyond that. I've asked Ray, hopefully 
he'll get a chance to answer.

If I were to hazard a guess -- and please, bear in mind that it's just a guess 
-- I would say that Joined is probably for the case of letters that are smudged 
(to not have to have ligatures for every combination), and Broken|0|1 is maybe 
to have a placeholder when most of a letter is faded. 

Tesseract does only support a small subset of Unicode; the aim is to get good 
coverage for a particular language, though bearing in mind that foreign words 
(such as names) do appear. It helps to cut down on a lot of ambiguities to not 
have Cyrillic characters for a language that uses Latin letters and vice versa, 
for example.

Aside from the special characters mentioned here, there's very little 
hard-coded character treatment, and its growing smaller all the time. 

The thing that will really have an impact on the results is that your 
unicharset uses the default kerning information. If you can locate a good set 
of fonts that feature these characters, we can extract better information from 
them, and that will give better results.

Original comment by joregan on 13 May 2015 at 10:36

oliveiracwb / tesseract-ocr

unicharset_extractor's output is broken for a particular case #1426