Closed archenovalis closed 1 year ago
thanks - the current GGML implementation "uses" the GPU during the prompt decode step but doesn't then use it for model forward passes which... as you've observed is not actually that helpful. I'm planning on trying to get the code from llama that does GPU offloading of actual layers working asap.
The new release provides full GPU offload support! Try this version and set -e GPU_LAYERS=100
to attempt to load all layers into RAM. I will warn you that currently we can only use 1 GPU so turbopilot won't fully utilise both of your devices but it should fill up the ram on the 3090.
thank you ^^
does not use any gpu memory, extremely slow, only using CPU. the 3090 with 24gb VRAM is enough for the entire model to be loaded into it.