vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.98k stars 4.53k forks source link

Implement custom kernels for top-k and top-p sampling #125

Open WoosukKwon opened 1 year ago

WoosukKwon commented 1 year ago

As mentioned in https://github.com/WoosukKwon/cacheflow/pull/81#issuecomment-1546980281, the current PyTorch-based top-k and top-p implementation is memory-inefficient. This can be improved by introducing custom kernels.

SuperCB commented 1 year ago

Are you referring to implementing a more efficient CUDA kernel for this function?

SuperCB commented 1 year ago

Or try to implement a sampling operator that can simultaneously support multiple sampling parameters?

WoosukKwon commented 1 year ago

Hi @SuperCB, more specifically, this includes

  1. Replacing torch.sort and torch.topk with more (memory-)efficient CUDA kernels.
  2. A sampling kernel that can simultaneously support multiple sampling parameters, as you said.
SuperCB commented 1 year ago

Is using nvidia's cub 's sort is a good choice?

SuperCB commented 1 year ago

In addition, I am currently working on writing a sort function that does not use cub

SuperCB commented 1 year ago

I compared the sort function from NVIDIA's cub library with PyTorch's sort function in a benchmark test. The implementation results are as follows: it's evident that cub's sort function is slower than PyTorch's sort function in terms of speed. However, cub's sort function consumes significantly less GPU memory compared to PyTorch's sort. Given that the sort function contributes relatively less to the overall inference latency of the large language model, have you considered replacing PyTorch's sort function with cub's sort for performance optimization? image image

SuperCB commented 1 year ago
import torch
import vllm_sort
from loguru import logger

class CudaTimer:

    def __init__(self,name ):
        self.name = name
        self.start_event = torch.cuda.Event(enable_timing=True)
        self.end_event = torch.cuda.Event(enable_timing=True)

    def __enter__(self):
        self.start_event.record()

    def __exit__(self, exc_type, exc_val, exc_tb) -> bool:
        self.end_event.record()
        self.end_event.synchronize()
        self.elapsed_time_ms = self.start_event.elapsed_time(self.end_event)  
        elapsed_time_seconds = self.elapsed_time_ms / 1000  
        formatted_float = "{:.8f}".format(elapsed_time_seconds)
        logger.info(f"{self.name} cost is : {formatted_float}s")
        return True
def main(size):
    logger.error(f'size is {size}')
    out= torch.rand((1,size), device='cuda:0')
    tensor = torch.rand((1,size), device='cuda:0')
    with CudaTimer('pytorch sort'):
        torch.sort(tensor)
    with CudaTimer('cub sort'):
        vllm_sort.vllm_sort(tensor,out)

if __name__=="__main__":
    for i in range(1,9):
        main(i*10**6)

This is the code I used for testing, where vllm_sort internally calls the DeviceSegmentedSort function provided by cub, and I implemented this logic.

SuperCB commented 1 year ago

@WoosukKwon

WoosukKwon commented 1 year ago

@SuperCB Thanks for your effort. Could you also measure the memory usage of the two implementations?

calvin1978 commented 1 year ago

TRT-LLM has these kernels, why not just ........

SuperCB commented 1 year ago

has these kernels, why not just ........

I couldn't find the open-source code for the project 'TRT-LLM'.

SuperCB commented 1 year ago

@SuperCB Thanks for your effort. Could you also measure the memory usage of the two implementations? The following are the memory usage percentages for two algorithms when sorting 10^8 numbers.

image image

calvin1978 commented 1 year ago

faster transformer has these kernels:

https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/kernels/sampling_topk_kernels.cu

calvin1978 commented 1 year ago

has these kernels, why not just ........

I couldn't find the open-source code for the project 'TRT-LLM'.

I mean "TensorRT-LLM: A TensorRT toolbox for Large Language Models" , it didn't released as GA by Nvidia yet. yet.

SuperCB commented 1 year ago

faster transformer has these kernels:

https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/kernels/sampling_topk_kernels.cu

Yes, I am in the process of migrating this operator from FasterTransformer, but FasterTransformer does not have a 'sort' operator

SuperCB commented 1 year ago

I have implemented a new kernel on this branch, but I'm currently not quite sure how to integrate this operation into VLLM. I hope @WoosukKwon can provide me with some suggestions. In this pull request (PR), I have completed the following two functionalities.

SuperCB commented 1 year ago

Sorry for bothering you once again. I have implemented a topk operator that supports multiple parameters. If you want to use this operator, it will require some changes to the vllm code. I would like to know if you would like to see these modifications. If these modifications are necessary for the development of vllm, I am more than happy to help you implement them. @WoosukKwon

hmellor commented 8 months ago

Has this work been completed?

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!