[Question] Parallel computations using multiple streams?

❓ General Questions

Hello, in phi model, attention and mlp blocks can be executed in parallel because they do not have dependency. In the following code, self.mixer and self.mlp can be executed in parallel.

    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int):
        residual = hidden_states
        hidden_states = self.ln(hidden_states)

        with tp.shard_bias(self.mixer.out_proj, self.tensor_parallel_shards), tp.shard_bias(
            self.mlp.fc2, self.tensor_parallel_shards
        ):
            attn_outputs = self.mixer(hidden_states, paged_kv_cache, layer_id)
            feed_forward_hidden_states = self.mlp(hidden_states)

        hidden_states = self._apply_parallel_residual(
            attn_outputs, feed_forward_hidden_states, residual
        )

        return hidden_states

Questions

In the above code, self.mixer and self.mlp are executed sequentially?
If yes, do you know why it is implemented as sequential computations (because of no improvement in performance?)?
If yes, is there any way to parallelize them? like using cuda multi-streams (e.g., torch.cuda.Stream, torch.cuda.Event) ... I'm just wondering whether mlc-llm can parallelize independent computations or not (regardless of performance of the above code).

mlc-ai / mlc-llm

[Question] Parallel computations using multiple streams? #2332

❓ General Questions