Support code gen for non-cuda targets with gpt-fast

pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

BSD 3-Clause "New" or "Revised" License

5.58k stars 508 forks source link

Support code gen for non-cuda targets with gpt-fast #37

Closed mikekgfb closed 9 months ago

mikekgfb commented 10 months ago

Extend existing device variable to support code gen for other targets.

This PR adds a new command line argument to generate.py to select a device --device ['cpu', 'cuda'] # we have the option to add devices such as MPS in the future

jgong5 commented 10 months ago

cc @mingfeima

mingfeima commented 10 months ago

@mikekgfb cool ! We are just about to do something similar on the CPU device. We will add native support for int4 kernels on CPU:

So that int4 quant will run smoothly on CPU device, other dtypes bf16, f16 may just reply on MKL.

mikekgfb commented 9 months ago

I’ll remove it until we find whether it buys us performance (I’ve seen above additional improvements for cpu sdpa land from the Intel team land since I did my experiments)

We’ll also want to check out what code is generated for Int8 data type the performance drop is beyond what appears rational (5-6 QPS vs 2-ish QPS with int8)

jgong5 commented 9 months ago

hi @mikekgfb May I know your plan of landing this PR? We want to propose a follow-up PR for Intel GPU based on it. Thanks!

mikekgfb commented 9 months ago

Waiting on a review which is required to merge. Addressed @Chillee 's feedback. If he's not available who else can review? @cpuhrsch @jisaacso ?

mingfeima commented 9 months ago

Cool this is merged :) I will wrap up code and upstream CPU backend optimization kernels to pytorch soon.