SpeechT5 TTS wrong attention mask handling when padding the input text

Spycsh commented 1 year ago

System Info

transformers version: 4.32.0.dev0
Platform: Windows
Python version: 3.10.6
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

This can be quickly reproduced on a laptop, but I think the platform is irrelevant.

git clone https://github.com/Spycsh/minimal-speecht5-pad-bug.git
conda create -n minimal-speecht5-bug ( and activate)
conda install python==3.10
pip install -r requirements.txt
python main.py

Then you will find an output.wav that is correctly generated of that input text. Then please comment line 12 and uncomment line 14, which will pad the input to fixed length (with pad_token_id=1) and generate the audio. However, you will find that the output audio is very bad and it includes many trailing words that do not belong to the original text. You can also change the input text in the code to see that it is a common phenomenon.

Expected behavior

As mentioned above, the output should also be the same and correct with or without the padding. The padded tokens should not be treated as the input tokens (the attention mask should work correctly). I think this is a bug from https://github.com/huggingface/transformers/blob/main/src/transformers/models/speecht5/modeling_speecht5.py#L2542, where this padding inputs case is not taken into account.

A quick fix is to replace the encoder_attention_mask = torch.ones_like(input_values) to encoder_attention_mask = (1-(input_values==1).int()) in the _generate_speech method.

I am actually use this padding to satisfy a "static" shape input for acceleration on intel habana gaudi2, another intel hardware. And it truly outperforms with padding added. I think this fix may not be really useful for normal users who does not even need "static" shape inputs. But I think this fix may be there to make the model to be more robust to various usages.

Please let me know if there is any mistakes and if you think it's truly an issue, I'd love to make a PR to fix this :)

ArthurZucker commented 1 year ago

Feel free to open a PR, I think that a minimal change like this could be good! cc @sanchit-gandhi

sanchit-gandhi commented 1 year ago

Would it be cleaner if we just passed return_attention_mask=True to the feature extractor:

inputs = feature_extractor(raw_speech, sampling_rate=sampling_rate, return_attention_mask=True, return_tensors="pt", padding=True)

And then forwarded this attention mask to the ._generate_speech method:

waveform = model.generate(inputs.input_values, attention_mask=inputs.attention_mask)

IMO this is more robust than masking everywhere we have zero's in the input waveform.

The thing we then need to check is that we get consistent generations if we do batched generation, i.e. if we pass two audios of different length to model, does it mask the inputs correctly when it does generation?

Think this would make for a nice PR! cc @ylacombe since you've worked on SpeechT5 generate recently - could you guide this integration?

ylacombe commented 1 year ago

Yes with pleasure! @Spycsh, would you like to work on this PR ? Let me know if you have any questions!

Spycsh commented 1 year ago

Hi @ylacombe , thanks for guidance. Yes I've made a PR and here it is https://github.com/huggingface/transformers/pull/25943.

I think I only made a minimal change to the modeling_speecht5.py code to make the attention mask work correctly when doing padding to the input text. I do not use a feature_extractor as sanchit pointed out above because I think the _generate_speech and generate_speech expect users to pass only the input_ids, not inputs with attention mask. Suppose we do feature_extractor before invoking model.generate_speech, we have to change the generate_speech parameters. And suppose we do feature_extractor after model.generate_speech, we only get the input_ids and fail to get the needed raw_speech and fail to get whether it is padded padding=True.

Similarly, when it comes to batched generation, I just write following code to make a test (simple updates to https://github.com/Spycsh/minimal-speecht5-pad-bug)

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf

model = SpeechT5ForTextToSpeech.from_pretrained(f"microsoft/speecht5_tts")
processor = SpeechT5Processor.from_pretrained(f"microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained(f"microsoft/speecht5_hifigan")
speaker_embedding = torch.load(f"speaker_embeddings/spk_embed_default.pt")

def text2speech(text, output_audio_path):
    # The original one that should succeed to convert text to audio
    # inputs = processor(text=text, return_tensors="pt")
    # The one that use padding and will finally convert text to wrong audio because of the attention mask is not well handled in modeling_speecht5.py
    inputs = processor(text=text, padding='max_length', max_length=128, return_tensors="pt")
    print(inputs["input_ids"].shape)    # torch.Size([1, 128])
    batched_inputs = torch.cat((inputs["input_ids"], inputs["input_ids"]))
    print(batched_inputs.shape) # torch.Size([2, 128])

    with torch.no_grad():
        spectrogram = model.generate_speech(batched_inputs, speaker_embedding)
        speech = vocoder(spectrogram)
    sf.write(output_audio_path, speech.cpu().numpy(), samplerate=16000)

if __name__ == "__main__":
    text = "I have a dream."
    text2speech(text, "output.wav")

It runs into an error ValueError: Attention mask should be of size (1, 1, 1, 256), but is torch.Size([2, 1, 1, 128]) on line https://github.com/huggingface/transformers/blob/main/src/transformers/models/speecht5/modeling_speecht5.py#L1001.

I am not sure whether batched generation can correctly work in the original implementation of speechT5. For me I do not find an example of that https://huggingface.co/blog/speecht5.

To make my PR more robust I just change the encoder_attention_mask = (1-(input_values==1).int()) to encoder_attention_mask = (1 - (input_values==model.config.pad_token_id).int()).

So that's all my change to the code. Thanks for further suggestions if I miss something or there are better solutions :)

sanchit-gandhi commented 1 year ago

Hey @Spycsh - we should update the .generate method to accept an attention mask as one of the key-word arguments. This attention mask will indicate to the model where the inputs have been padded, and thus where to ignore them when running inference. The user can then input:

input_ids (required): token ids from the tokenizer
attention_mask (optional): attention mask from the tokenizer, required for batched inference to signal to the model where to ignore padded tokens from the input_ids.

Spycsh commented 1 year ago

Hi @sanchit-gandhi , I added that in the latest commit in https://github.com/huggingface/transformers/pull/25943. We could continue this discussion in that PR.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sanchit-gandhi commented 10 months ago

Closed by #25943.

huggingface / transformers