r9y9 / nnmnkwii

Library to build speech synthesis systems designed for easy and fast prototyping.
https://r9y9.github.io/nnmnkwii/latest/
Other
393 stars 74 forks source link

Improved support for labels #33

Closed karandwivedi42 closed 6 years ago

karandwivedi42 commented 7 years ago

Hi

Thanks for writing this useful library! I am trying it from a few days and felt the need for better support for non HTS labels.

It would be good to have something like this: https://github.com/facebookresearch/loop/blob/master/utils.py#L143 which does not depend on label files and uses nltk 's cmudict to generate phonemes.

I can contribute if you guide me.

My current workaround is that I use merlin's scripts to generate test and train labels to use with your code.

r9y9 commented 7 years ago

Hi, thank you very much for your feedback. Yes, support for non HTS labels is great to have.

https://github.com/facebookresearch/loop/blob/master/utils.py#L143 which does not depend on label files and uses nltk 's cmudict to generate phonemes.

It seems that there's no nltk in utils.py? Could you elaborate what you want?

FWIW, the reason I started writing support for HTS labels is that merlin frontend assumes input is HTS-style labels.

karandwivedi42 commented 7 years ago

Thanks!

In facebookresearch/loop, the file generate.py we can give any user sentence as input. Then it uses ntlk to generate the phonemes here.

karandwivedi42 commented 7 years ago

Also, I am curious that loop does not involve durations at any point and yet is able to generate good output.

r9y9 commented 7 years ago

Okay, I see. In that case, isn't nltk enough? Just 10 lines of code. Also I tend to think the library should be language independent, though text2phone is highly language (phoneme dictionary) dependent. What do you think?

karandwivedi42 commented 7 years ago

Yes, it is. I don't have much experience in speech, so I don't know if nltk (or similar library) supports other languages.

I am still trying to understand how facebook's loop uses text and audio features. I think that the attention mechanism allows it to work without having forced alignment, which is why the dataset gives phonemes as input and audio_features as the target even though they have different shapes.

 ('phonemes', (21,)),
 ('audio_features', (279, 63)),

This type of processing completely removes the need to include merlin/hts/htk/sptk (we can use pyworld for audio features extraction and synthesis) and nltk for text phonemes.

This sort of pipeline serves a somewhat different purpose (seq-to-seq models) from the ones in your notebooks/merlin (which have one to one mapping between input and output), but I am sure they will be a good addition to your library as both loop and parrot use somewhat similar.

What do you think?

r9y9 commented 7 years ago

As far as I understand correctly, loop uses raw text features similar to Tacotron. Attention mechanism learns alignment between raw text and audio features.

This type of processing completely removes the need to include merlin/hts/htk/sptk (we can use pyworld for audio features extraction and synthesis) and nltk for text phonemes.

You are right. I completely agree.

This sort of pipeline serves a somewhat different purpose from the ones in your notebooks/merlin, but I am sure they will be a good addition to your library as both loop and parrot use somewhat similar.

I plan to consider end-to-end speech synthesis paradigm in design (see #9, #3 for reference), so contributions for it are very welcome! Personally, from my experience working on Tacotron (https://github.com/r9y9/tacotron_pytorch), I didn't think nothing must-have functionality we should add, but probably I should think again more carefully and also need to look at existing code bases as you pointed out. Thank you!

karandwivedi42 commented 7 years ago

I agree that nltk's cmudict can only convert words in its dictionary, which is very limiting. However, it removes the festival dependancy, which is a big plus. Is there any other way to convert text to phonemes without needing to use festival?

r9y9 commented 7 years ago

If you just need character-level numeric representation of text, not structural information that festival can annotate, maybe https://github.com/keithito/tacotron/tree/master/text would be enough?

In [1]: from text import sequence_to_text, text_to_sequence

In [2]: sequence = text_to_sequence("Hello world", ["english_cleaners"])

In [3]: print(sequence)
[35, 32, 39, 39, 42, 64, 50, 42, 45, 39, 31, 1]

In [4]: print(sequence_to_text(sequence))
hello world~

EDIT: oops, sorry I may have misunderstand your question. Tacotron uses char-level representation, but loop uses phoneme-level representation. Attached code won't work for phonemes.