Medusa models seem to be slower than the original base models

System Info

Thank you for adding support for Medusa. In my comparison of Medusa models versus the original base models with TGI, the latter appeared to be quicker.

I tested the below models:

text-generation-inference/gemma-7b-it-medusa
text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa
text-generation-inference/Mistral-7B-Instruct-v0.2-medusa
FasterDecoding/medusa-vicuna-7b-v1.3 ( revision="refs/pr/1" )

Screenshot 2024-03-13 at 11 11 00 PM

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Command used :

docker run --gpus all --shm-size 1g -p 8081:80 ghcr.io/huggingface/text-generation-inference:1.4.3 --model-id text-generation-inference/Mistral-7B-Instruct-v0.2-medusa --num-shard 1

Hardware:

1xH100

Expected behavior

Medusa models should be faster than the original non-medusa models

huggingface / text-generation-inference

Medusa models seem to be slower than the original base models #1641

System Info

Information

Tasks

Reproduction

Expected behavior