johnsmith0031 / alpaca_lora_4bit

MIT License
534 stars 84 forks source link

CPU finetuning #43

Open maxxk opened 1 year ago

maxxk commented 1 year ago

I'm looking into replacing CUDA 4-bit matrix multiplication with pure PyTorch from upstream GPTQ-for-LLaMa to enable finetuning on CPU in my local version.

I think, the following code can be moved into AutogradMatmul4bit like this: https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/ef255907e664cf727907954a7f19d50a00db6066/quant.py#L280-L287

class AutogradMatmul4bit(torch.autograd.Function):

    @staticmethod
    def forward(ctx, x, qweight, scales, zeros, wf, g_idx):
        ctx.save_for_backward(qweight, scales, zeros, wf)
        ctx.g_idx = g_idx

        weight = torch.bitwise_right_shift(torch.unsqueeze(qweight, 1).expand(-1, 32 // 4, -1), wf.unsqueeze(-1)).to(torch.int8)
        torch.bitwise_and(weight,(2 ** 4) - 1, out=weight)

        zeros = torch.bitwise_right_shift(torch.unsqueeze(qzeros, 2).expand(-1, -1, 32 // 4), wf.unsqueeze(0)).to(torch.int8)
        torch.bitwise_and(zeros, (2 ** 4) - 1, out=zeros)

        weight = weight.reshape(weight.shape[0] * weight.shape[1], weight.shape[2])

        zeros = zeros + 1
        zeros = zeros.reshape(zeros.shape[0], zeros.shape[1] * zeros.shape[2])

        weights = (scales[groups] * (weight - zeros[groupsize]))

        output = torch.matmul(x, weights.to(x.dtype))
        return output

and the same for backward with transposition. Is it worth trying, or there is something I miss which will prevent it from working?

johnsmith0031 commented 1 year ago

I think It would be very very slow... but anyway, would add it for cpu support