Open wenyuzzz opened 4 months ago
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
The model to consider.
Thanks to the efforts of the vllm team. Recently, I am preparing to optimize the inference performance of WeMM, with the link provided below.
https://huggingface.co/feipengma/WeMM-Chat-2k-CN
The closest model vllm already supports.
WeMM is based on internlm2.
What's your difficulty of supporting the model you want?
The overall framework starts with modeling_wemm.py, which passes the data to modeling_internlm2.py.
However, the model modeling_internlm2.py replaces the basic linear layer with Plora and adds a mask. The code is available at: WeMM-Chat-2k-CN The code for PLoRAis as follows: `
class PLoRA(nn.Module):
`
In subsequent code, PLoRA has replaced Linear.
`
class InternLM2MLP(nn.Module): def init(self, config): super().init() self.config = config self.hidden_size = config.hidden_size self.intermediate_size = config.intermediate_size
self.w1 = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
class InternLM2Attention(nn.Module): """Multi-headed attention from 'Attention Is All You Need' paper"""
`
If you directly replace Linear with PLoRA in this code, whether attention and MLP in the follow-up need to be modified.
`
class PLoRA(nn.Module):
`
Or are there other ways to modify, looking forward to your reply.