microsoft / vidur

A large-scale simulation framework for LLM inference
MIT License
271 stars 41 forks source link

Check whether the calculation of the 'add_time' is incorrect? #34

Open yipengLeo opened 1 month ago

yipengLeo commented 1 month ago

I found the_get_block_execution_time() function in the vidur/entities/execution_time.py path computes 'add_time' only once.

def _get_block_execution_time(self) -> float:
    return (
        self._get_attention_layer_execution_time()
        + self._get_mlp_layer_execution_time()
        + self._add_time
    )

But in other places, for example, under the vidur/metrics/metrics_store.pypath. 'add_time' is calculated twice

self._push_metric(
OperationMetrics.ADD, batch_id, execution_time.add_time * 2
)

Is this a bug?

AgrawalAmey commented 1 month ago

@yipengLeo thanks for pointing this out, there should be two add operations per block in llama models as far as i remember, let me look into this

nitinkedia7 commented 1 month ago

Hi @yipengLeo, there is indeed a bug here. However, the impact is relatively little as the add_time is a small percentage of the time.

For all the models supported right now except phi-2, there are two add operations in each layer. (only phi-2 has post_attn_norm: false in its model config) 1st add: https://github.com/microsoft/vidur/blob/main/vidur/profiling/mlp/mlp_impl.py#L172 2nd add: https://github.com/microsoft/vidur/blob/main/vidur/profiling/mlp/mlp_impl.py#L179

So, if we ignore phi-2, it would be better to add add_time twice. A proper solution would be to consider these slight differences between model architectures. Please feel free to raise a PR for fixing this!