oliveiracwb / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Music symbols #606

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run the attached image through Tesseract

What is the expected output? What do you see instead?
Expected result: "♫ ♪ Music playing ♫ ♪"
Actual result: "D J' Music playing D J'"

What version of the product are you using? On what operating system?
3.01

These symbols are pretty common in subtitles, so it would be brilliant if you 
could add these characters!

Original issue reported on code.google.com by nikse.dk@gmail.com on 20 Jan 2012 at 11:15

Attachments:

GoogleCodeExporter commented 9 years ago
successfully generated output similar to musicsymols.png - vide attached files.
Tested under version 3.02 in Winxp(sp3). traineddata also attached.
-sriranga(79yrs)

Original comment by withbles...@gmail.com on 26 Feb 2012 at 2:26

Attachments:

GoogleCodeExporter commented 9 years ago
I have a similar need, when processing text lines from within Audiveris OMR 
program.
Audiveris is able to process the staves and music symbols but delegates to 
Tesseract the transcription of text lines. Unfortunately, some text lines 
happen to contain a music character, for example (see the 2 attached files 
also):
- A tempo indication is often written as (J = 69), where 'J' should be the 
quarter sign
- A guitar chord name with a flat alteration, like Abm for A flat minor, where 
'b' should actually be the flat sign.
We have just switched from Tesseract V2 to V3.02. How could we use the training 
features of Tesseract to recognize these musical symbols (a very limited 
number: quarter, eighth, flat, nothing more).
Could withblessings@gmail.com give us a hand based on his example?
Thanks
/Hervé (owner of open source Audiveris)

Original comment by herve.bi...@gmail.com on 16 Jun 2012 at 3:39

Attachments:

GoogleCodeExporter commented 9 years ago
herve,
please visit  site http://unicode.org/charts/PDF/U1D100.pdf . It appears that  
music  fonts  as per unicode chart are not available. For training purpose 
music fonts  just like English fonts are required. If you are able to furnish 
music fonts, I shall try to generate trainedata file for the fonts supplied by 
you.
sriranga(79yrs)

Original comment by withbles...@gmail.com on 17 Jun 2012 at 6:29

GoogleCodeExporter commented 9 years ago
I recently discovered Musica, a free music font which complies with unicode 
values.
I'm using it to train Tesseract on musical symbols (work still in progress)
See http://users.teilar.gr/~g1951d/ and click on musica link
/Hervé

Original comment by herve.bi...@gmail.com on 26 Jun 2012 at 7:18

GoogleCodeExporter commented 9 years ago
IMO the music symbol are not very common.
I would suggest to create custom "language" as Google did for mathematical 
symbols (see equ package) for tesseract-ocr 3.02. In 3.02 version brought 
simultaneous multi-language capability, so you can run something like this:
  tesseract andantino.png andantino -l eng+music

if you create music.traineddata

Original comment by zde...@gmail.com on 5 Nov 2012 at 9:43

GoogleCodeExporter commented 9 years ago
OK, it sounds like a solution with eng+music :)

Original comment by nikse.dk@gmail.com on 6 Nov 2012 at 6:06

GoogleCodeExporter commented 9 years ago

Original comment by zde...@gmail.com on 15 Nov 2012 at 8:46

GoogleCodeExporter commented 9 years ago
sriranga... I don't suppose you still have the files used in the 
"combine_tessdata" commando?

Original comment by nikse.dk@gmail.com on 7 Feb 2013 at 6:04

GoogleCodeExporter commented 9 years ago
sorry.Since i have deleted all stored in particular drive to make space
for other items.
However I have uploaded the traineddata file under comment no:1 which can
be used.

Original comment by withbles...@gmail.com on 8 Feb 2013 at 7:01

GoogleCodeExporter commented 9 years ago
OK, thx for the info.

I managed to get it working a bit. Now I just need a few more fonts and also 
italic. I'm getting there, but training Tesseract is not super easy...

Original comment by nikse.dk@gmail.com on 8 Feb 2013 at 7:31