randaller / llama-chat

Chat with Meta's LLaMA models at home made easy
GNU General Public License v3.0
833 stars 118 forks source link

GPU vs CPU which one is best? #33

Open masterchop opened 1 year ago

masterchop commented 1 year ago

I have a question but first thank you for sharing this amazing project its great for starters. I have been using the chat bot but when I run the code with CUDA on the 7B model it is super slow, i mean really really bad. but when I do CPU I does work way better almost real time. The automapping feature is also very very slow. PC: Intel 7i 6GEN 8 cores Memory 32GB VC: Nvidia2070 8GB

Could someone explain me why the GPU performs worst than the CPU and memory? Is it better to just get more memory to be able to run the 14B or should i get one of those nvidia cards like Tesla with 24GB? my testing tell me its better to get memory instead of a very expensive GPU.

Thanks

nafets33 commented 10 months ago

this does't make sense to me as i have seen the opposite, cpu is super slow and only gpu can truly speed things up. llamaccp however is the faster cpu version but my query times talking to my own embedded data is still 1-3 mins per query

buckleybrian commented 8 months ago

I have a question on memory (and I also observe very slow response using a GPU). On my Dell Precision 7780 workstation laptop with the 7B model I see the Nvidia GPU mem goes up to about 3GB out of the available 6GB. I assume this is due to 4-bit quantization? BUT, the CPU memory shoots up to almost 95% of my 32GB so it appears it is both shipping to GPU and CPU at the same time? Looking at the code I see in model.py these lines:

    for layer in tqdm(self.layers, desc="flayers", leave=True):
        if use_gpu:
            move_parameters_to_gpu(layer)
        h = layer(h, start_pos, freqs_cis, mask)
        if use_gpu:
            **move_parameters_to_cpu(layer)**

Why whould it move parameters to CPU in that if use_gpu statement?