How VLLM handles LoRA weights?

valer1435 commented 1 day ago

hi

valer1435 commented 1 day ago

How tensor parallelism is implemented in code? @RepoPilotAssistant

valer1435 commented 1 day ago

How tensor parallelism is implemented in code? @RepoPilotAssistant

RepoPilotAssistant commented 1 day ago

Tensor parallelism in the code is implemented by distributing the model's tensors across multiple devices or GPUs to parallelize computation. This is achieved using functions like inplace_all_reduce and outplace_all_reduce, which handle the distribution and aggregation of tensors across a group. Additionally, methods such as _register_group and _get_unique_name manage the registration and naming of groups, ensuring that tensors are appropriately distributed and gathered based on their device type (CPU or GPU).

Please follow links below:

valer1435 commented 1 day ago

How to implement custom model with tensor parallelism? @RepoPilotAssistant

RepoPilotAssistant commented 1 day ago

To implement a custom model with tensor parallelism in the VLLM framework, you need to follow these steps:

Rewrite the forward methods: Modify the forward pass of your model to handle tensor parallelism.
Implement tensor parallelism and quantization support:
- Replace the linear and embedding layers in your model with their tensor-parallel versions.
- Use VocabParallelEmbedding for the embedding layer and ParallelLMHead for the output LM head.
- For linear layers, choose from ReplicatedLinear, RowParallelLinear, ColumnParallelLinear, MergedColumnParallelLinear, or QKVParallelLinear based on your needs.
Implement the weight loading logic: Create the load_weights method in your *ForCausalLM class to load weights from HuggingFace’s checkpoint file and assign them to the corresponding layers. Handle separated weight matrices for MergedColumnParallelLinear and QKVParallelLinear layers.
Register your model: Register your *ForCausalLM class to the _VLLM_MODELS in vllm/model_executor/models/registry.py.

Alternatively, for out-of-tree model integration, you can skip steps 1 and 5 and directly register your model using the ModelRegistry.register_model method.

Please follow links below:

valer1435 commented 1 day ago

Please explain in detail how chunk prefilling works @RepoPilotAssistant

RepoPilotAssistant commented 1 day ago

Chunked prefill is an experimental feature in vLLM that allows large prefill requests to be divided into smaller chunks, which can then be batched together with decode requests. This feature can be enabled by specifying --enable-chunked-prefill in the command line or setting enable_chunked_prefill=True in the LLM constructor. When enabled, the scheduler prioritizes decode requests, batching all pending decode requests before scheduling any prefill. If a prefill request cannot fit into the available token budget (max_num_batched_tokens), it is chunked. This approach improves inter-token latency (ITL) and generation decode, as well as better GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. The performance can be tuned by adjusting max_num_batched_tokens, with a default value of 512 optimized for ITL on A100 GPUs. For better throughput, it is recommended to set max_num_batched_tokens greater than 2048.

Please follow links below:

valer1435 / RepoPilot

How VLLM handles LoRA weights? #36