ravenscroftj / turbopilot

Turbopilot is an open source large-language-model based code completion engine that runs locally on CPU
BSD 3-Clause "New" or "Revised" License
3.83k stars 127 forks source link

docker turbopilot:v0.1.0-cuda12 not using gpu #49

Closed archenovalis closed 1 year ago

archenovalis commented 1 year ago
docker run --gpus=all --rm -it \
  -v /home/ubuntu/LLM-Models:/models \
  -e MODEL_TYPE=wizardcoder \
  -e MODEL="/models/coding-WizardCoder-15B-1.0.GGMLv3.q8_0/WizardCoder-15B-1.0.ggmlv3.q8_0.bin" \
  -p 18080:18080 \
  ghcr.io/ravenscroftj/turbopilot:v0.1.0-cuda12

[2023-08-14 00:09:17.888] [info] Initializing Starcoder/Wizardcoder type model for 'wizardcoder' model type
[2023-08-14 00:09:17.888] [info] Attempt to load model from wizardcoder
load_model: loading model from '/models/coding-WizardCoder-15B-1.0.GGMLv3.q8_0/WizardCoder-15B-1.0.ggmlv3.q8_0.bin'
load_model: n_vocab = 49153
load_model: n_ctx   = 8192
load_model: n_embd  = 6144
load_model: n_head  = 48
load_model: n_layer = 40
load_model: ftype   = 2007
load_model: qntvr   = 2
load_model: ggml ctx size = 34536.48 MB
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
load_model: memory size = 15360.00 MB, n_mem = 327680
load_model: model size  = 19176.25 MB
[2023-08-14 00:09:39.728] [info] Loaded model in 21839.60ms
(2023-08-14 00:09:40) [INFO    ] Crow/1.0 server is running at http://0.0.0.0:18080 using 32 threads
(2023-08-14 00:09:40) [INFO    ] Call `app.loglevel(crow::LogLevel::Warning)` to hide Info level logs.

does not use any gpu memory, extremely slow, only using CPU. the 3090 with 24gb VRAM is enough for the entire model to be loaded into it.

ravenscroftj commented 1 year ago

thanks - the current GGML implementation "uses" the GPU during the prompt decode step but doesn't then use it for model forward passes which... as you've observed is not actually that helpful. I'm planning on trying to get the code from llama that does GPU offloading of actual layers working asap.

ravenscroftj commented 1 year ago

The new release provides full GPU offload support! Try this version and set -e GPU_LAYERS=100 to attempt to load all layers into RAM. I will warn you that currently we can only use 1 GPU so turbopilot won't fully utilise both of your devices but it should fill up the ram on the 3090.

archenovalis commented 1 year ago

thank you ^^