Open maxxk opened 1 year ago
I'm looking into replacing CUDA 4-bit matrix multiplication with pure PyTorch from upstream GPTQ-for-LLaMa to enable finetuning on CPU in my local version.
I think, the following code can be moved into AutogradMatmul4bit like this: https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/ef255907e664cf727907954a7f19d50a00db6066/quant.py#L280-L287
class AutogradMatmul4bit(torch.autograd.Function): @staticmethod def forward(ctx, x, qweight, scales, zeros, wf, g_idx): ctx.save_for_backward(qweight, scales, zeros, wf) ctx.g_idx = g_idx weight = torch.bitwise_right_shift(torch.unsqueeze(qweight, 1).expand(-1, 32 // 4, -1), wf.unsqueeze(-1)).to(torch.int8) torch.bitwise_and(weight,(2 ** 4) - 1, out=weight) zeros = torch.bitwise_right_shift(torch.unsqueeze(qzeros, 2).expand(-1, -1, 32 // 4), wf.unsqueeze(0)).to(torch.int8) torch.bitwise_and(zeros, (2 ** 4) - 1, out=zeros) weight = weight.reshape(weight.shape[0] * weight.shape[1], weight.shape[2]) zeros = zeros + 1 zeros = zeros.reshape(zeros.shape[0], zeros.shape[1] * zeros.shape[2]) weights = (scales[groups] * (weight - zeros[groupsize])) output = torch.matmul(x, weights.to(x.dtype)) return output
and the same for backward with transposition. Is it worth trying, or there is something I miss which will prevent it from working?
backward
I think It would be very very slow... but anyway, would add it for cpu support
I'm looking into replacing CUDA 4-bit matrix multiplication with pure PyTorch from upstream GPTQ-for-LLaMa to enable finetuning on CPU in my local version.
I think, the following code can be moved into AutogradMatmul4bit like this: https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/ef255907e664cf727907954a7f19d50a00db6066/quant.py#L280-L287
and the same for
backward
with transposition. Is it worth trying, or there is something I miss which will prevent it from working?