Open lethean1 opened 1 month ago
CS Drafting is built upon the Huggingface transformer library which handles inference parallel. You can try specifying a device_map when loading the model before passing it to the CountedCSDraftingDecoderModelKVCache
{:.ruby} class.
It seems that specifying device_map can only support pipeline parallel?
I want to use tensor parallelism with CS-drafting, but I do not find the config to start the tensor parallel, can you give me an example?