microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
3.01k stars 202 forks source link

About running speed #23

Open NieShenRuc opened 1 year ago

NieShenRuc commented 1 year ago

Thanks for your excellent work! I have mentioned that torchscale serially executes the operation of mapping x to q, k, and v, in line 84~86 in file torchscale/component/multihead_attention.py. Will this be slower in your approach compared to doing it in parallel? For example, self.qkv_proj=nn.Linear(embed_dim, 3 * embed_dim)