Issue related to dtype with F.conv1d in Whisper evaluation

moncefbenaicha commented 6 months ago

System Info

transformers version: 4.39.3
Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35
Python version: 3.11.8
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.29.2
Accelerate config: not found
PyTorch version (GPU?): 2.2.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I use the classical whisper fine-tuning pipeline, similar to what @sanchit-gandhi published at https://huggingface.co/blog/fine-tune-whisper.

The problem arises when I use the argument predict_with_generate=True The script crashed in evaluation exactly in encoder.forward pass and return this exception :

return F.conv1d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

The error never shows up in training steps, only the moment evaluation starts or if you call a trainer.evaluate()

I did some debugging to check dtype and used device before F.conv1d is called, and that's what I got:

Training pass

input.dtype torch.float32

weight.dtype torch.bfloat16

bias.dtype torch.bfloat16

input.device device(type='cuda', index=0)

weight.device device(type='cuda', index=0)

No exception

evaluation pass

input.dtype torch.float32

weight.dtype torch.bfloat16

bias.dtype torch.bfloat16

input.device device(type='cuda', index=0)

weight.device device(type='cuda', index=0)

bias.device device(type='cuda', index=0)

Raises a RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

Expected behavior

Similar behavior between Training and Evaluation pass

LysandreJik commented 6 months ago

cc @sanchit-gandhi @ylacombe

moncefbenaicha commented 6 months ago

Update

A temporary solution is to force the input to bfloat16 and disable flash_attention.

batch = self.processor(
            audio=audio_arrays,
            sampling_rate=16000,
            padding="max_length",
            return_tensors="pt",
        )
batch["input_features"] = batch["input_features"].to(dtype=torch.bfloat16)

ylacombe commented 6 months ago

Hey @moncefbenaicha, you're temporary solution is actually the right one since the processor only outputs torch.float32 arrays!

However, I do believe it should work with Flash Attention, have you got an error using bloat16 and FA?

sanchit-gandhi commented 5 months ago

Hey @moncefbenaicha - it would be great to see:

How you're instantiating the model using .from_pretrained. Specifically, what argument you're passing to attn_implementation, torch_dtype, and whether you're moving the model manually to a torch device
The training args you're using. Specifically, what you set for fp16, bf16, fp16_full_eval and bf16_full_eval

Passing bf16_full_eval=True might be of interest to you if you're casting the model weights to bf16 manually yourself.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers