Open SanjuCSudhakaran opened 3 days ago
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck
CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck
build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo
or khluu
to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can do one of these:
ready
label to the PR🚀
Can we do it like CustomOp
, where the punicawrapper remains unchanged, and we decide which ops to call based on the hardware? I think if we do it this way, it might have better extensibility.
GaudiPunicaWrapper
is a class inherited from PunicaWrapper
which does the following.
add_lora
, add_lora_packed_nslice
and add_lora_logits
.add_lora_embedding
instead of add_expand
to handle LoRA B embedding computation.I'd to get a little more clarity on CustomOp
.
I see that classes like RotaryEmbedding
using CustomOp
to automatically handle the forward pass based on the backend device. Here RotaryEmbedding
is inherited from torch.nn.Module
and CustomOp
handles which forward pass to call.
But here neither PunicaWrapper
nor the ops in it are torch.nn.modules
. So to use CustomOp
the ops in PunicaWrapper
would have to be rewritten to torch.nn.Module
s which requires a lot of refactoring.
@jeejeelee Please correct if I am missing something here.
@SanjuCSudhakaran Thanks for your feedback. I'm mainly considering that other hardware may need to support LoRA in the future. If we support it in your current way, it might lead to a lot of redundant conditionals. IMHO, the main difference between different hardware should be in the implementation of the 6 LoRA ops. BTW, I actually have ideas and plans for extending LoRA support to other hardware. If you're interested, we can discuss this on Slack.
This PR enables support for LoRA on Intel Gaudi by adding HPU specific kernels in
GaudiPunicaWrapper
to handle LoRA computations more efficiently on the hardware.