Open jsoto-gladia opened 2 days ago
cc @ylacombe @eustlb
how can reduce this time ? (i have looked into storing the compiled model with pytorch but it does not seem supported) (i have tried compiling torch_tensorrt but i have the error EncoderDecoderCache encountered in the dynamo_compile input parsing)>>
''' -----Explore alternative methods to serialize the model, such as using TorchScript or ONNX for exporting the model & inspecting the model architecture to ensure compatibility with TensorRT.
Hey,
Indeed, when using Torch compile modes that capture CUDA graphs (reduce-overhead
andmax-autotune
), we currently face the issue where a graph is captured for each of the generated sequence lengths. As you pointed out, for this reason it’s necessary to warm up with min_new_tokens=max_new_tokens=448
.
You can cache the compiled FX graphs, which makes the default compile mode warmup about 3x faster. However, there’s still no way to cache CUDA graphs captured between processes, so this long warmup is required each time you rerun the Python process.
This is a known issue that has been looked into:
https://github.com/pytorch/pytorch/issues/128424
I’ll dive deeper to see if there’s a solution we can apply here. Feel free to explore it further as well!
hello thanks for your response. I am observing the described behaviour however I see that the shapes at each iteration are always the same (except for the first one).... why I am still getting a warm up ?
step 1: value.last_hidden_state -> torch.Size([1, 1500, 384]) value.self_attention_cache -> torch.Size([1, 6, 203, 64]) value.cross_attention_cache -> torch.Size([1, 6, 1500, 64]) decoder_input_ids shape: torch.Size([1, 3]) cache_position shape: torch.Size([3]) 'decoder_attention_mask': None attentions=None
step 2: value.last_hidden_state -> torch.Size([1, 1500, 384]) value.self_attention_cache -> torch.Size([1, 6, 203, 64]) value.cross_attention_cache -> torch.Size([1, 6, 1500, 64]) decoder_input_ids shape: torch.Size([1, 1]) cache_position shape: torch.Size([1]) 'decoder_attention_mask': None attentions=None
step 3: value.last_hidden_state -> torch.Size([1, 1500, 384]) value.self_attention_cache -> torch.Size([1, 6, 203, 64]) value.cross_attention_cache -> torch.Size([1, 6, 1500, 64]) decoder_input_ids shape: torch.Size([1, 1]) cache_position shape: torch.Size([1]) 'decoder_attention_mask': None attentions=None
''' torch_device = "cuda" set_seed(0)
# load model
model_name = "openai/whisper-tiny"
model = WhisperForConditionalGeneration.from_pretrained(
model_name, attention_implementation="sdpa", torch_dtype="float16")
tokenizer = WhisperTokenizerFast.from_pretrained(model_name)
model.to(torch_device)
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
model.generation_config.cache_implementation = "static"
input_speech = self._load_datasamples(1)
'''
i tried to explore the export of compiled graphs but the issue is still open https://github.com/pytorch/pytorch/issues/101107
Feature request
after torch compiling the whisper.text_decoder model, the inference time is crazy low !. Thank you for the work ! however the warm up time is very long since it needs to go through all logits (at a maximum of 448)
how can reduce this time ? (i have looked into storing the compiled model with pytorch but it does not seem supported) (i have tried compiling torch_tensorrt but i have the error EncoderDecoderCache encountered in the dynamo_compile input parsing)
Motivation
the start up time of the model can take around 10m for a large model
Your contribution
happy to do a pr but need guidance