rakuri255 / UltraSinger

AI based tool to convert vocals lyrics and pitch from music to autogenerate Ultrastar Deluxe, Midi and notes. It automatic tapping, adding text, pitch vocals and creates karaoke files.
MIT License
230 stars 19 forks source link

Questionable results when hyphenating #105

Open DoubleDee73 opened 7 months ago

DoubleDee73 commented 7 months ago

Sometimes the output of the automatic hyphenation leaves a bit to be desired.

Examples:

bohning commented 7 months ago

That’s why I switched to dictionary files for UltraStar Creator: https://github.com/UltraStar-Deluxe/UltraStar-Creator/tree/master/syllabification.

As a side note, we’re talking about syllabification (splitting in to singable syllables) rather than hyphenation (splitting of written words).

rakuri255 commented 6 months ago

Ok something is broken.. Thanks @DoubleDee73 for the exampels.

UltraSinger actually already uses syllables and not simple hyphenation. hyphenator.Syllables(cleaned_string) The funny thing is that it returns different results depending on the language and yet they are all wrong.

assert hyphenation("differently", Hyphenator("de_AT")) == ["dif", "fer", "ent", "ly"]
Expected :['dif', 'fer', 'ent', 'ly']
Actual :['dif', 'ferent', 'ly']
assert hyphenation("differently", Hyphenator("en_US")) == ["dif", "fer", "ent", "ly"]
Expected :['dif', 'fer', 'ent', 'ly']
Actual :['dif', 'fer', 'ently']

I need to check what the PyHyphen integration is actually doing there. It actually should use the information from LibreOffice..

@bohning thanks for the list. Will try to use it, if i cant fix PyHyphen.

@mindtakerr thanks for the info about the howmanysyllables website. This makes it easy to check and shows how syllabels are actually formed.

rakuri255 commented 6 months ago

PyHyphen uses C in the background to create syllables. It's not really written in a maintenance-friendly way. I think it makes a few mistakes.

In addition, the hyphen pattern data from LibreOffice are converted from TEX data. They also appear to be outdated.