Open wangshuai09 opened 1 month ago
we can do it step by step.
is_cpu -> current_platform.is_cpu is_xpu -> current_platform.is_xpu is_openvino -> current_platform.is_openvino is_neuron -> current_platform.is_neuron
this can be the first step, and should be easy to do.
the rest might need some case-by-case discussion.
JFYI, refactoring of neuron backend checking is done by #9374
Although I think this device-agnostic framework is wonderful, it is actually quite challenging to find a balance between high-level abstraction and low-level performance.
can you elaborate on that?
Hey, I think the idea is very interesting and the problem surely must've been tackled many times across many projects.
Personally I think the unified interface that needs to be provided here is a bit too granular still, ie worker needs to call too much into accelerator-specific functions to carry out its logic.
Bringing one example to the table, ort https://onnxruntime.ai/docs/execution-providers/ has the concept of "ExecutionProvider", but the interface it's simple enough as to group common operations into higher level framework-specific abstractions so you don't have to implement dozens of functions. TFLite had delegates but I think the example isn't as good.
Some pain points off the top of my head: execution on cpu will likely implement a small subset of all the ops , executor/worker/interface logic has to have good defaults. Calling into accelerator closed source lib may not implement all the functions (ie not applicable here but still, CoreML), same point.
Each backend in ort will implement its own ExecutionProvider
which based IExecutionProvider
, i think it is similiar with xxxPlatform
and Platform
in vllm. For cpu, it is also easy to remove or reimplement some ops based Platform
.
Of course, a full discussion is necessary.
Thanks for your diccusion and help, we have finished the first step of Backend Type Check
and are ready to work for Backend Releated Func
.
This second step wants to remove backend releated func which need if...else..
to process for different hardwares. The Platform
will provide interface with the same func name, and each xxxPlatform
can implement its own.
Can you give me some advice for the second step? Thanks.
Of course, a full discussion is necessary. Thanks for your diccusion and help, we have finished the first step of
Backend Type Check
and are ready to work forBackend Releated Func
. This second step wants to remove backend releated func which needif...else..
to process for different hardwares. ThePlatform
will provide interface with the same func name, and eachxxxPlatform
can implement its own. Can you give me some advice for the second step? Thanks.
@youkaichao could you give some advice on this?
you can try to find some code like this, very long if-else branching logic based on current_platform
. they can be unified by current_platform.get_default_atten_backend
or something like that.
we should start by sorting the number of if-else branches.
if there are more than 3 branches, it means at least 3 backends support this feature, and we can move it inside platforms
.
if not, we can just keep them right now.
we should start by sorting the number of if-else branches.
if there are more than 3 branches, it means at least 3 backends support this feature, and we can move it inside
platforms
.if not, we can just keep them right now.
Thanks! I got what you mean, I'll do this work step by step.
@youkaichao I list the remaining methods involving multiple backend branches, and will implement them one by one in the following PRs. If you have any suggestions, please let me know.
code path | func | func-refactor | related backends | other info |
---|---|---|---|---|
vllm/config.py |
ModelConfig._verify_quantization |
current_platform |
rocm/tpu/neuron | |
~vllm/config.py ~ |
~DeviceConfig.__init__ ~ |
~current_platform.device_config_init ~ |
~cuda_like/neuron/hpu/openvino/tpu/cpu/xpu~ | |
vllm/utils.py |
is_pin_memory_available |
current_platform.is_pin_memory_available |
xpu/neuron/hpu/cpu/openvino | TODO: How to deal with in_wsl ? Just leave it here? |
vllm/model_executor/custom_op.py |
CustomOp.dispatch_forward |
current_platform.custom_forward |
when enabled, rocm/cpu/hpu/tpu/xpu/cuda-for-default |
please don't directly change DeviceConfig.__init__
, but have current_platform.device_type
to be a string, and call current_platform.device_type
in DeviceConfig.__init__
.
let's do it step by step, others need further discussion.
I added https://github.com/vllm-project/vllm/pull/10402 as a first step to absorb some config checking and updating code into platforms/
. @wangshuai09 @MengqingCao if you are interested, welcome to do the same thing for xpu executor/open vino executor, etc.
I added #10402 as a first step to absorb some config checking and updating code into
platforms/
. @wangshuai09 @MengqingCao if you are interested, welcome to do the same thing for xpu executor/open vino executor, etc.
Sure, I'll strat this work at xpu exectutor :-)
@youkaichao Good day! The next step I wanna take is refactoring CustomOp.dispatch_forward
. Considering that there are many child classes of CustomOp
, and they override different forward methods, the first step is to extract the forward dispatch func to Platform
. There is a tiny example showing the changes I want to make:
CustomOp.dispatch_forward
def dispatch_forward(self):
# NOTE(woosuk): Here we assume that vLLM was built for only one
# specific backend. Currently, we do not support dynamic dispatching.
compilation_config = get_current_vllm_config().compilation_config
enabled = self.enabled()
if enabled:
compilation_config.enabled_custom_ops.update([self.__class__.name])
+ return current_platform.get_customop_forward(self)
else:
compilation_config.disabled_custom_ops.update(
[self.__class__.name])
-
- if not enabled:
return self.forward_native
xxxPlatform
(e.g., CudaPlatform
)
class CudaPlatform(Platform):
...
+ def get_customop_forward(cls, current_op: CustomOp):
+ return current_op.forward_cuda
Could you give me some suggestion on this? Thanks a lot!
@MengqingCao CustomOp.dispatch_forward
is more complicated, let's keep it unchanged right now.
@youkaichao What do you think about the refactoring of is_pin_memory_available
? I noticed that wsl is only checked in is_pin_memory_available
, thus I would prefer to just leave it here and add is_pin_memory_available
for each platform.
Motivation.
vLLM
has already been adapted to many hardware devices, such asGPU
,TPU
, andXPU
. However, adapting these backends requires implementing separateWorker/Executor/Model Runner
frameworks for each, which leads to code redundancy and maintenance difficulties. In fact, these hardware framework codes can be abstracted at the device layer, forming a unified framework. This way, only one set of code would need to be maintained, and different backends would only need to implement the device layer interfaces and any device-specific logic if necessary. I also found some new features are only updated on GPU-related codes. In fact, these codes are also applicable to other hardware, but it is difficult for other hardware to perceive and follow these updates.Proposed Change.
This RFC is intended to establish a unified framework. Maybe there will be diffuculty for intergrating hardware framework to common framework, It makes sense to work towards this direction, the diagram below represents a proposed solution:
Taking
Executor
as example, for third-party hardware devices based on thepytorch
ecosystem, the basic interfaces of torch have been well adapted, so after abstracting the device-related hard coding, such astorch.cuda
,torch.xpu
,GPU Executor
could be used as theCommon Executor
of all third-party devices.Following https://github.com/vllm-project/vllm/pull/6080, different hardware backends can put their own device-specific code in
NewBackendPlatform
, so that the framework can be device-agnostic throughcurrent_platform
. For example,torch.cuda.synchronize
could usecurrent_platform.synchronize
.Feedback Period.
To realize this idea will involve more files, so the following steps are currently sorted out to finally achieve the above purpose:
is_cpu
->current_platform.is_cpu
is_xpu
->current_platform.is_xpu
is_openvino
->current_platform.is_openvino
is_neuron
->current_platform.is_neuron
is_hip
->current_platform.is_rocm
seed_everything
->current_platform.seed_everything
is_pin_memory_available
->current_platform.is_pin_memory_available
DeviceMemoryProfiler
->current_platform.memory_profiler
wrap_device
->current_platform.wrap_device
torch.xxx.get_device_name
->current_platform.get_device
torch.xxx.Event
->current_platform.Event
torch.xxx.synchronize
->current_platform.synchronize
torch.xxx.Stream
->current_platform.Stream
torch.xxx.stream
->current_platform.stream
torch.xxx.empty_cache
->current_platform.empty_cache
torch.xxx.device_count
->current_platform.device_count
torch.xxx.memory_allocated
->current_platform.memroy_allocated
torch.xxx.set_device
->current_paltform.set_device
torch.xxx.current_device
->current_platform.current_device
torch.xxx.get_device_capability
->current_platform.get_device_capability
gpu(neuron,openvino,tpu,xpu,..)_executor
->common_backend_executor
gpu(neuron,openvino,tpu,xpu,..)_worker
->common_backend_worker
gpu(neuron,openvino,tpu,xpu,..)_model_runner
->common_backend_model_runner
There must be omissions or difficulties in actual implementation here, keep updating.
CC List.
@youkaichao @WoosukKwon
Any Other Things.
No response
Before submitting a new issue...