mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
17.69k stars 1.41k forks source link

[Question] Parallel computations using multiple streams? #2332

Open taegeonum opened 1 month ago

taegeonum commented 1 month ago

❓ General Questions

Hello, in phi model, attention and mlp blocks can be executed in parallel because they do not have dependency. In the following code, self.mixer and self.mlp can be executed in parallel.

    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int):
        residual = hidden_states
        hidden_states = self.ln(hidden_states)

        with tp.shard_bias(self.mixer.out_proj, self.tensor_parallel_shards), tp.shard_bias(
            self.mlp.fc2, self.tensor_parallel_shards
        ):
            attn_outputs = self.mixer(hidden_states, paged_kv_cache, layer_id)
            feed_forward_hidden_states = self.mlp(hidden_states)

        hidden_states = self._apply_parallel_residual(
            attn_outputs, feed_forward_hidden_states, residual
        )

        return hidden_states

Questions

tqchen commented 1 month ago

this is a good question, it might be possible , however phi is a small model so the impact may not be too observable. As of now we didn't yet try to use multi-stream, but updating compiler to enable manual stream specification could be possible.