lucidrains / spear-tts-pytorch

Implementation of Spear-TTS - multi-speaker text-to-speech attention network, in Pytorch
MIT License
249 stars 18 forks source link

I have bad results of backtraslation #14

Open nassimabenammar opened 9 months ago

nassimabenammar commented 9 months ago

Hello, I am testing your code and I am confused with the results of the backtranslation. Let me describe the configuration of TextToSemantic model, and pretraining and backtranslation trainers.

Preprocessing

Speech to speech trainer :

Results : loss 1.2, 200k steps

Backtranslation trainer :

text_to_semantic_model = TextToSemantic( dim = 256, num_text_token_ids = 79, text_pad_id = 0, num_semantic_token_ids = wav2vec.codebook_size, semantic_pad_id = 0, source_depth = 6, target_depth = 6, heads = 8, dim_head = 64, attn_dropout = 0.5, ff_mult = 2, ff_dropout = 0.5 )

Results : Reference text (audio 9s) : If you sit back a little from the table, and lay the mirror, face upwards, upon your lap, you can see, as you deal, every card that you give to your adversary.

After 47900 steps --> accuracy 92% Backtranslated text : YO be ro cal po cadyo gar frelaroro be ad fre t tle thea t te t fre.— Backtranslated tokens : [38, 28, 1, 43, 46, 1, 59, 56, 1, 44, 42, 53, 1, 57, 56, 1, 44, 42, 45, 66, 56, 1, 48, 42, 59, 1, 47, 59, 46, 53, 42, 59, 56, 59, 56, 1, 43, 46, 1, 42, 45, 1, 47, 59, 46, 1, 61, 1, 61, 53, 46, 1, 61, 49, 46, 42, 1, 61, 1, 61, 46, 1, 61, 1, 47, 59, 46, 9, 79]

After 10800 steps --> accuracy 89% 200 tokens

Backtranslated text : "Sh sitl fro del a tl tabl a tel tabe, a yo tabe ad yo te ad yo gil tel tabl, ad yo tel ad yo table able t yo te te tel tel te te tel te te te te te te tel the tele tel th the tele the yo del table.—

After 6200 steps --> accuracy 82% Backtranslated text : She sit Bacolitle from table, as you del, as you deal, hard you deal, as you del, hard you deal, as you del, and you del, hard you del, and you gave, you deal, and you deal, and you gave table, and you deal at you gaver at resary.—

Reference text (audio 4s) : This was not, as it may seem, merely a theory tinged with sarcasm.

After 10800 steps --> accuracy 89% Backtranslated text : Th was not ason.—

After 4600 steps --> accuracy 79% This was not, as it may seem, merely a thery, tinged with sarcasm.—

I have different issues :

I need your help to understand what's wrong with my training configuration.

lucasnewman commented 8 months ago

Can you share training/validation loss curves or the actual training code? It's hard to know the exact issue from the description.

A couple of initial thoughts: 1) Your learning rate is pretty warm and you may be overfitting depending on your dataset size, try dropping it an order of magnitude during the backtranslation task. 2) You might consider trimming/mapping some of the rarely-occurring graphemes in your list to reduce the prediction space for the task and balance the classes more evenly.