I have bad results of backtraslation

Hello, I am testing your code and I am confused with the results of the backtranslation. Let me describe the configuration of TextToSemantic model, and pretraining and backtranslation trainers.

Preprocessing

target_sample_hz : 24_000
Wav2vec : hubert_base_ls960_L9_km500.bin
data_max_length_seconds = 10
SemanticDataset with max_length= int(data_max_length_seconds * target_sample_hz),

Speech to speech trainer :

Batchsize 64
Grad axxum every : 1
Learning rate : initial_lr=1e-5, lr = 1e-3,

Results : loss 1.2, 200k steps

Backtranslation trainer :

grapheme_dict = {' ': 1, '!': 2, '"': 3, "'": 4, '(': 5, ')': 6, ',': 7, '-': 8, '.': 9, '/': 10, ':': 11, ';': 12, '?': 13, 'A': 14, 'B': 15, 'C': 16, 'D': 17, 'E': 18, 'F': 19, 'G': 20, 'H': 21, 'I': 22, 'J': 23, 'K': 24, 'L': 25, 'M': 26, 'N': 27, 'O': 28, 'P': 29, 'Q': 30, 'R': 31, 'S': 32, 'T': 33, 'U': 34, 'V': 35, 'W': 36, 'X': 37, 'Y': 38, 'Z': 39, '[': 40, ']': 41, 'a': 42, 'b': 43, 'c': 44, 'd': 45, 'e': 46, 'f': 47, 'g': 48, 'h': 49, 'i': 50, 'j': 51, 'k': 52, 'l': 53, 'm': 54, 'n': 55, 'o': 56, 'p': 57, 'q': 58, 'r': 59, 's': 60, 't': 61, 'u': 62, 'v': 63, 'w': 64, 'x': 65, 'y': 66, 'z': 67, '{': 68, '}': 69, '¯': 70, 'æ': 71, 'è': 72, 'é': 73, 'ê': 74, 'ñ': 75, 'ò': 76, 'ô': 77, 'œ': 78, '—': 79}
Batchsize 32
Grad acum every 2
Learning rate : initial_lr=1e-5, lr = 1e-3,
Restore_optimizer False

text_to_semantic_model = TextToSemantic( dim = 256, num_text_token_ids = 79, text_pad_id = 0, num_semantic_token_ids = wav2vec.codebook_size, semantic_pad_id = 0, source_depth = 6, target_depth = 6, heads = 8, dim_head = 64, attn_dropout = 0.5, ff_mult = 2, ff_dropout = 0.5 )

Results : Reference text (audio 9s) : If you sit back a little from the table, and lay the mirror, face upwards, upon your lap, you can see, as you deal, every card that you give to your adversary.

After 47900 steps --> accuracy 92% Backtranslated text : YO be ro cal po cadyo gar frelaroro be ad fre t tle thea t te t fre.— Backtranslated tokens : [38, 28, 1, 43, 46, 1, 59, 56, 1, 44, 42, 53, 1, 57, 56, 1, 44, 42, 45, 66, 56, 1, 48, 42, 59, 1, 47, 59, 46, 53, 42, 59, 56, 59, 56, 1, 43, 46, 1, 42, 45, 1, 47, 59, 46, 1, 61, 1, 61, 53, 46, 1, 61, 49, 46, 42, 1, 61, 1, 61, 46, 1, 61, 1, 47, 59, 46, 9, 79]

After 10800 steps --> accuracy 89% 200 tokens

Backtranslated text : "Sh sitl fro del a tl tabl a tel tabe, a yo tabe ad yo te ad yo gil tel tabl, ad yo tel ad yo table able t yo te te tel tel te te tel te te te te te te tel the tele tel th the tele the yo del table.—

After 6200 steps --> accuracy 82% Backtranslated text : She sit Bacolitle from table, as you del, as you deal, hard you deal, as you del, hard you deal, as you del, and you del, hard you del, and you gave, you deal, and you deal, and you gave table, and you deal at you gaver at resary.—

Reference text (audio 4s) : This was not, as it may seem, merely a theory tinged with sarcasm.

After 10800 steps --> accuracy 89% Backtranslated text : Th was not ason.—

After 4600 steps --> accuracy 79% This was not, as it may seem, merely a thery, tinged with sarcasm.—

I have different issues :

The accuracy, while continuously increasing, does not reflect the quality of these samples
Plus even for the 4s sample, the result is acceptable for set of steps (4600, 8200) and bad for steps in between (5700, 6200)
Even though the model has been trained on wav files of a maximum length of 10 secondes, samples with 7 to 10 seconds have always bad backtranslation results, whatever is the accuracy.

I need your help to understand what's wrong with my training configuration.

lucidrains / spear-tts-pytorch

I have bad results of backtraslation #14