huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.92k stars 26.78k forks source link

speed up whisper compile time #34313

Open jsoto-gladia opened 2 days ago

jsoto-gladia commented 2 days ago

Feature request

after torch compiling the whisper.text_decoder model, the inference time is crazy low !. Thank you for the work ! however the warm up time is very long since it needs to go through all logits (at a maximum of 448)

how can reduce this time ? (i have looked into storing the compiled model with pytorch but it does not seem supported) (i have tried compiling torch_tensorrt but i have the error EncoderDecoderCache encountered in the dynamo_compile input parsing)

Motivation

the start up time of the model can take around 10m for a large model

Your contribution

happy to do a pr but need guidance

LysandreJik commented 1 day ago

cc @ylacombe @eustlb

KyleBrian commented 10 hours ago

how can reduce this time ? (i have looked into storing the compiled model with pytorch but it does not seem supported) (i have tried compiling torch_tensorrt but i have the error EncoderDecoderCache encountered in the dynamo_compile input parsing)>>

''' -----Explore alternative methods to serialize the model, such as using TorchScript or ONNX for exporting the model & inspecting the model architecture to ensure compatibility with TensorRT.

eustlb commented 10 hours ago

Hey,

Indeed, when using Torch compile modes that capture CUDA graphs (reduce-overhead andmax-autotune), we currently face the issue where a graph is captured for each of the generated sequence lengths. As you pointed out, for this reason it’s necessary to warm up with min_new_tokens=max_new_tokens=448.

You can cache the compiled FX graphs, which makes the default compile mode warmup about 3x faster. However, there’s still no way to cache CUDA graphs captured between processes, so this long warmup is required each time you rerun the Python process.

This is a known issue that has been looked into:

jsoto-gladia commented 6 hours ago

hello thanks for your response. I am observing the described behaviour however I see that the shapes at each iteration are always the same (except for the first one).... why I am still getting a warm up ?

step 1: value.last_hidden_state -> torch.Size([1, 1500, 384]) value.self_attention_cache -> torch.Size([1, 6, 203, 64]) value.cross_attention_cache -> torch.Size([1, 6, 1500, 64]) decoder_input_ids shape: torch.Size([1, 3]) cache_position shape: torch.Size([3]) 'decoder_attention_mask': None attentions=None

step 2: value.last_hidden_state -> torch.Size([1, 1500, 384]) value.self_attention_cache -> torch.Size([1, 6, 203, 64]) value.cross_attention_cache -> torch.Size([1, 6, 1500, 64]) decoder_input_ids shape: torch.Size([1, 1]) cache_position shape: torch.Size([1]) 'decoder_attention_mask': None attentions=None

step 3: value.last_hidden_state -> torch.Size([1, 1500, 384]) value.self_attention_cache -> torch.Size([1, 6, 203, 64]) value.cross_attention_cache -> torch.Size([1, 6, 1500, 64]) decoder_input_ids shape: torch.Size([1, 1]) cache_position shape: torch.Size([1]) 'decoder_attention_mask': None attentions=None

''' torch_device = "cuda" set_seed(0)

    # load model
    model_name = "openai/whisper-tiny"
    model = WhisperForConditionalGeneration.from_pretrained(
        model_name, attention_implementation="sdpa", torch_dtype="float16")
    tokenizer = WhisperTokenizerFast.from_pretrained(model_name)
    model.to(torch_device)
    model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
    model.generation_config.cache_implementation = "static"

    input_speech = self._load_datasamples(1)

'''

jsoto-gladia commented 6 hours ago

i tried to explore the export of compiled graphs but the issue is still open https://github.com/pytorch/pytorch/issues/101107