Open WoosukKwon opened 1 year ago
Are you referring to implementing a more efficient CUDA kernel for this function?
Or try to implement a sampling operator that can simultaneously support multiple sampling parameters?
Hi @SuperCB, more specifically, this includes
torch.sort
and torch.topk
with more (memory-)efficient CUDA kernels.In addition, I am currently working on writing a sort function that does not use cub
I compared the sort function from NVIDIA's cub library with PyTorch's sort function in a benchmark test. The implementation results are as follows: it's evident that cub's sort function is slower than PyTorch's sort function in terms of speed. However, cub's sort function consumes significantly less GPU memory compared to PyTorch's sort. Given that the sort function contributes relatively less to the overall inference latency of the large language model, have you considered replacing PyTorch's sort function with cub's sort for performance optimization?
import torch
import vllm_sort
from loguru import logger
class CudaTimer:
def __init__(self,name ):
self.name = name
self.start_event = torch.cuda.Event(enable_timing=True)
self.end_event = torch.cuda.Event(enable_timing=True)
def __enter__(self):
self.start_event.record()
def __exit__(self, exc_type, exc_val, exc_tb) -> bool:
self.end_event.record()
self.end_event.synchronize()
self.elapsed_time_ms = self.start_event.elapsed_time(self.end_event)
elapsed_time_seconds = self.elapsed_time_ms / 1000
formatted_float = "{:.8f}".format(elapsed_time_seconds)
logger.info(f"{self.name} cost is : {formatted_float}s")
return True
def main(size):
logger.error(f'size is {size}')
out= torch.rand((1,size), device='cuda:0')
tensor = torch.rand((1,size), device='cuda:0')
with CudaTimer('pytorch sort'):
torch.sort(tensor)
with CudaTimer('cub sort'):
vllm_sort.vllm_sort(tensor,out)
if __name__=="__main__":
for i in range(1,9):
main(i*10**6)
This is the code I used for testing, where vllm_sort internally calls the DeviceSegmentedSort function provided by cub, and I implemented this logic.
@WoosukKwon
@SuperCB Thanks for your effort. Could you also measure the memory usage of the two implementations?
TRT-LLM has these kernels, why not just ........
has these kernels, why not just ........
I couldn't find the open-source code for the project 'TRT-LLM'.
@SuperCB Thanks for your effort. Could you also measure the memory usage of the two implementations? The following are the memory usage percentages for two algorithms when sorting 10^8 numbers.
faster transformer has these kernels:
has these kernels, why not just ........
I couldn't find the open-source code for the project 'TRT-LLM'.
I mean "TensorRT-LLM: A TensorRT toolbox for Large Language Models" , it didn't released as GA by Nvidia yet. yet.
faster transformer has these kernels:
Yes, I am in the process of migrating this operator from FasterTransformer, but FasterTransformer does not have a 'sort' operator
I have implemented a new kernel on this branch, but I'm currently not quite sure how to integrate this operation into VLLM. I hope @WoosukKwon can provide me with some suggestions. In this pull request (PR), I have completed the following two functionalities.
Sorry for bothering you once again. I have implemented a topk operator that supports multiple parameters. If you want to use this operator, it will require some changes to the vllm code. I would like to know if you would like to see these modifications. If these modifications are necessary for the development of vllm, I am more than happy to help you implement them. @WoosukKwon
Has this work been completed?
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
As mentioned in https://github.com/WoosukKwon/cacheflow/pull/81#issuecomment-1546980281, the current PyTorch-based top-k and top-p implementation is memory-inefficient. This can be improved by introducing custom kernels.