liltom-eth / llama2-webui

Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
MIT License
1.97k stars 202 forks source link

Very slow generation #83

Open jaslatendresse opened 11 months ago

jaslatendresse commented 11 months ago

I am running this on Mac M1 16GB RAM using app.py for simple text generation. Using the llama.cpp from terminal is much faster but when I use the backend through app.py is very slow. Any ideas?

arnaudberenbaum commented 11 months ago

Hello ! I was in the same situation and found the solution :

  1. Fist, check if your python env is configured to be arm64 and not x86 : python -c "import platform; print(platform.platform())" it shoud return : macOS-14.2.1-arm64-arm-64bit

  2. If it's not, you need to create a new env (i'm using Conda): CONDA_SUBDIR=osx-arm64 conda create -n your_env python=the_version_you_want

  3. You clone the github repo and install the package llama2-wraper: python -m pip install llama2-wrapper

  4. And then you reinstall the llama-ccp-python package for arm64: python -m pip uninstall llama-cpp-python -y CMAKE_ARGS="-DLLAMA_METAL=on" pip install -U llama-cpp-python --no-cache-dir python -m pip install 'llama-cpp-python[server]'

  5. When it's done, you need to modify the file "~/llama2-webui/llama2_wrapper/model.py":

    • in the function (line 118), you need to add the param "n_gpu_layers=-1" :
      @classmethod
      def create_llama2_model(
      cls, model_path, backend_type, max_tokens, load_in_8bit, verbose
      ):
      if backend_type is BackendType.LLAMA_CPP:
          from llama_cpp import Llama
          model = Llama(
              model_path=model_path,
              n_ctx=max_tokens,
              n_batch=max_tokens,
              verbose=verbose,
              n_gpu_layers=-1 # I added this line to force the model to run on GPU ARM
  6. Profit ! It should be fast to generate content (on Macbook pro M1 pro 16G memory, It went to 1 token every 2 seconds to 10 tokens per second !

Hope it helped ! :)