huggingface / parler-tts

Inference and training library for high-quality TTS models.

Apache License 2.0

4.3k stars 432 forks source link

attention_mask #62

Open netagl opened 4 months ago

netagl commented 4 months ago

hi, I have attention_mask problem mismatch in the cross attenstion

can you please explain this line: requires_attention_mask = "encoder_outputs" not in model_kwargs ?

why is comed after this: if "encoder_outputs" not in model_kwargs:

encoder_outputs are created and added to `model_kwargs`

        model_kwargs = self._prepare_text_encoder_kwargs_for_generation(
            inputs_tensor,
            model_kwargs,
            model_input_name,
            generation_config,
        )

is the attention mask is needed for the cross attnetion layer in the generation part? this mismach problem accure only in the generator the train & eval are ok.

tnx!

netagl commented 4 months ago

I think it might se related to this:

        encoder_attention_mask = _prepare_4d_attention_mask(
            encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
        )

In the comment written:

[bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]

but _prepare_4d_attention_mask returns src_seq_len as 1. Is this is what you ment? because it is not working well with the cross attention condition:

    if attention_mask is not None:
        if attention_mask.size() != (bsz, 1, tgt_len, src_len):
            raise ValueError(
                f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
            )

i would like some help @ylacombe

ylacombe commented 3 months ago

Hey @netagl, thanks for your message! not sure to understand your issue, could you send a code snippet to reproduce any potential issues ?

The attention mask is needed in the cross attention layer if you have a batch of samples, otherwise you don't need to pass it to the model!

kdcyberdude commented 3 months ago

@netagl, Is your audio_encoder_per_device_batch_size 1?

huggingface / parler-tts

attention_mask #62

encoder_outputs are created and added to model_kwargs

[bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]

encoder_outputs are created and added to `model_kwargs`