[WIP] Use DTensor-based tensor parallel

pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

BSD 3-Clause "New" or "Revised" License

5.34k stars 484 forks source link

Open kwen2501 opened 2 weeks ago

kwen2501 commented 2 weeks ago

Stack from ghstack (oldest at bottom):

Status:

Switched to DTensor based TP in regular tensor path
Result is correct, but there is a perf gap (seems to perform extra colls in the beginning, investigating)
TODO: switch to DTensor for quantized path too