Open valer1435 opened 1 day ago
How tensor parallelism is implemented in code? @RepoPilotAssistant
How tensor parallelism is implemented in code? @RepoPilotAssistant
Tensor parallelism in the code is implemented by distributing the model's tensors across multiple devices or GPUs to parallelize computation. This is achieved using functions like inplace_all_reduce
and outplace_all_reduce
, which handle the distribution and aggregation of tensors across a group. Additionally, methods such as _register_group
and _get_unique_name
manage the registration and naming of groups, ensuring that tensors are appropriately distributed and gathered based on their device type (CPU or GPU).
Please follow links below:
How to implement custom model with tensor parallelism? @RepoPilotAssistant
To implement a custom model with tensor parallelism in the VLLM framework, you need to follow these steps:
VocabParallelEmbedding
for the embedding layer and ParallelLMHead
for the output LM head.ReplicatedLinear
, RowParallelLinear
, ColumnParallelLinear
, MergedColumnParallelLinear
, or QKVParallelLinear
based on your needs.load_weights
method in your *ForCausalLM
class to load weights from HuggingFace’s checkpoint file and assign them to the corresponding layers. Handle separated weight matrices for MergedColumnParallelLinear
and QKVParallelLinear
layers.*ForCausalLM
class to the _VLLM_MODELS
in vllm/model_executor/models/registry.py
.Alternatively, for out-of-tree model integration, you can skip steps 1 and 5 and directly register your model using the ModelRegistry.register_model
method.
Please follow links below:
Please explain in detail how chunk prefilling works @RepoPilotAssistant
Chunked prefill is an experimental feature in vLLM that allows large prefill requests to be divided into smaller chunks, which can then be batched together with decode requests. This feature can be enabled by specifying --enable-chunked-prefill
in the command line or setting enable_chunked_prefill=True
in the LLM constructor. When enabled, the scheduler prioritizes decode requests, batching all pending decode requests before scheduling any prefill. If a prefill request cannot fit into the available token budget (max_num_batched_tokens
), it is chunked. This approach improves inter-token latency (ITL) and generation decode, as well as better GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. The performance can be tuned by adjusting max_num_batched_tokens
, with a default value of 512 optimized for ITL on A100 GPUs. For better throughput, it is recommended to set max_num_batched_tokens
greater than 2048.
Please follow links below:
hi