rustformers / llm

[Unmaintained, see README] An ecosystem of Rust libraries for working with large language models
https://docs.rs/llm/latest/llm/
Apache License 2.0
6.06k stars 351 forks source link

Support ggml metal backend #299

Closed pixelspark closed 1 year ago

pixelspark commented 1 year ago

Support for Metal GPU acceleration on macOS (and I assume iOS) just merged in llama.cpp master: https://github.com/ggerganov/llama.cpp/pull/1642

It would be great if this could also be employed from llm. I assume all it needs is something similar to https://github.com/rustformers/llm/pull/282/files and perhaps setting a flag at runtime (the -ngl 1 parameter should be set on llama.cpp's ./main to enable Metal).

darxkies commented 1 year ago

I could update the PR for metal support. Could you test it together with the other backends (cuBLAS / CLBlast)?

pixelspark commented 1 year ago

Definitely! Currently traveling but will have access to my M1 Max machine tomorrow again.

darxkies commented 1 year ago

Unfortunately, ggerganov/ggml does not have support for Metal yet. Thus it can not be enabled in rustformers/llm yet. Never the less it would be great if you could test the RP's cuBLAS / CLBlast functionality.

pixelspark commented 1 year ago

Hm, the Metal bits should be upstreamed to ggerganov/ggml soon, right? (As far as I understand it, it is wholly contained in two files).

I quickly tested CLBlast on my work laptop with llama.cpp itself, but its AMD Radeon GPU appears a bit too shitty for it to work. I will attempt a test with cuBLAS later on a more beefy machine with NVIDIA hardware.

darxkies commented 1 year ago

Soon is very relative. Some files like ggml-opencl. and ggml-cuda. for example haven't been touched for weeks.

darxkies commented 1 year ago

I've updated the BLAS PR to support Metal too. You can use it like so:

cargo run --release --features metal mpt infer --model-path mpt-7b-chat-q5_1.bin -p "Once upon a time"

Could you please test it out?

pixelspark commented 1 year ago

Nice 👍🏻 I am seeing the following errors at build time:

error: environment variable `CUDA_PATH` not defined at compile time
   --> crates/ggml/sys/build.rs:126:39
    |
126 |         let targets_include = concat!(env!("CUDA_PATH"), r"\include");
    |                                       ^^^^^^^^^^^^^^^^^
    |
    = help: use `std::env::var("CUDA_PATH")` to read the variable at run time
    = note: this error originates in the macro `env` (in Nightly builds, run with -Z macro-backtrace for more info)

error: environment variable `CUDA_PATH` not defined at compile time
   --> crates/ggml/sys/build.rs:127:35
    |
127 |         let targets_lib = concat!(env!("CUDA_PATH"), r"\lib\x64");
    |                                   ^^^^^^^^^^^^^^^^^
    |
    = help: use `std::env::var("CUDA_PATH")` to read the variable at run time
    = note: this error originates in the macro `env` (in Nightly builds, run with -Z macro-backtrace for more info)

error: environment variable `CUDA_PATH` not defined at compile time
   --> crates/ggml/sys/build.rs:171:39
    |
171 |         let targets_include = concat!(env!("CUDA_PATH"), "/targets/x86_64-linux/include");
    |                                       ^^^^^^^^^^^^^^^^^
    |
    = help: use `std::env::var("CUDA_PATH")` to read the variable at run time
    = note: this error originates in the macro `env` (in Nightly builds, run with -Z macro-backtrace for more info)

error: environment variable `CUDA_PATH` not defined at compile time
   --> crates/ggml/sys/build.rs:172:35
    |
172 |         let targets_lib = concat!(env!("CUDA_PATH"), "/targets/x86_64-linux/lib");
    |                                   ^^^^^^^^^^^^^^^^^
    |
    = help: use `std::env::var("CUDA_PATH")` to read the variable at run time
    = note: this error originates in the macro `env` (in Nightly builds, run with -Z macro-backtrace for more info)

error: could not compile `ggml-sys` (build script) due to 4 previous errors
warning: build failed, waiting for other jobs to finish...

The build finished when I prepend CUDA_PATH=x to the command. However, when I then run the suggested command for testing, it still uses the CPU (you may need to invoke GGML in a specific way to run on Metal - see ggml-metal.m. Effectively you should do what the main executable of llama.cpp does when you pass -ngl 1. This parameter determines the number of layers to offload to the GPU - for Metal, when the value is 1 or higher, it apparently runs everything on GPU).

darxkies commented 1 year ago

The offloading part is not there yet. All the code does so far is expose the API required to leverage that functionality in the near future.

Nevertheless, you should be able to see an increase in VRAM usage if Metal is really enabled. It should also be able to process lengthy prompts much much faster.

Could you check that? And could you also check if CLBlast exhibits the same behavior as Metal?

pixelspark commented 1 year ago

Running the suggested command does not appear to use the GPU (I get ±800% CPU usage which is expected for a CPU run. GPU monitor shows very little use).

Note that the generated binary also doesn't appear to link to Metal in any way:

image

Also the generated binary does not contain the string 'metal'. It appears nothing really changes when --features metal is passed? (I am on 022a075608c5b90c54946ad01a204a63c54657cf).

pixelspark commented 1 year ago

As for CLBlast: same issue with missing CUDA_PATH, and with CUDA_PATH=x cargo build --verbose --features clblast --release (I ran brew install clblast before):

The following warnings were emitted during compilation:

warning: llama-cpp/ggml-opencl.cpp:10:10: fatal error: 'clblast.h' file not found
warning: #include <clblast.h>
warning:          ^~~~~~~~~~~
warning: 1 error generated.

error: failed to run custom build command for `ggml-sys v0.2.0-dev (/Users/tommy/Repos/llm/crates/ggml/sys)`

Caused by:
  process didn't exit successfully: `/Users/tommy/Repos/llm/target/release/build/ggml-sys-2d4c323b8b047b52/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=llama-cpp
  OPT_LEVEL = Some("3")
  TARGET = Some("aarch64-apple-darwin")
  HOST = Some("aarch64-apple-darwin")
  cargo:rerun-if-env-changed=CC_aarch64-apple-darwin
  CC_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CC_aarch64_apple_darwin
  CC_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CFLAGS_aarch64-apple-darwin
  CFLAGS_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CFLAGS_aarch64_apple_darwin
  CFLAGS_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("false")
  CARGO_CFG_TARGET_FEATURE = Some("aes,crc,dit,dotprod,dpb,dpb2,fcma,fhm,flagm,fp16,frintts,jsconv,lor,lse,neon,paca,pacg,pan,pmuv3,ras,rcpc,rcpc2,rdm,sb,sha2,sha3,ssbs,vh")
  cargo:rustc-link-lib=clblast
  cargo:rustc-link-lib=OpenCL
  cargo:rustc-link-lib=framework=Accelerate
  running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-I" "llama-cpp" "-DGGML_USE_CLBLAST" "-mcpu=native" "-pthread" "-DGGML_USE_ACCELERATE" "-DNDEBUG" "-o" "/Users/tommy/Repos/llm/target/release/build/ggml-sys-ae54095a50ed1651/out/llama-cpp/ggml.o" "-c" "llama-cpp/ggml.c"
  running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-I" "llama-cpp" "-DGGML_USE_CLBLAST" "-mcpu=native" "-pthread" "-DGGML_USE_ACCELERATE" "-DNDEBUG" "-o" "/Users/tommy/Repos/llm/target/release/build/ggml-sys-ae54095a50ed1651/out/llama-cpp/ggml-opencl.o" "-c" "llama-cpp/ggml-opencl.cpp"
  cargo:warning=llama-cpp/ggml-opencl.cpp:10:10: fatal error: 'clblast.h' file not found
  cargo:warning=#include <clblast.h>
  cargo:warning=         ^~~~~~~~~~~
  cargo:warning=1 error generated.
  exit status: 1
  exit status: 0

  --- stderr

  error occurred: Command "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-I" "llama-cpp" "-DGGML_USE_CLBLAST" "-mcpu=native" "-pthread" "-DGGML_USE_ACCELERATE" "-DNDEBUG" "-o" "/Users/tommy/Repos/llm/target/release/build/ggml-sys-ae54095a50ed1651/out/llama-cpp/ggml-opencl.o" "-c" "llama-cpp/ggml-opencl.cpp" with args "cc" did not execute successfully (status code exit status: 1).
darxkies commented 1 year ago

The CUDA_PATH error should be gone with the last commit.

It seems you don't have clblast installed.

Can you check if ggml-metal.o in target was generated?

pixelspark commented 1 year ago

The CUDA_PATH error should be gone with the last commit.

Yes 👍🏻

It seems you don't have clblast installed.

Well it seems it cannot find it for some reason... this might just be my machine.

Can you check if ggml-metal.o in target was generated?

Unfortunately:

rm -rf target/release
cargo build --release --features=metal
find ./target/release | grep metal

Does not find anything 😢

darxkies commented 1 year ago

I disabled a couple of things in the last commit that might have caused issues. Can you please try again?

pixelspark commented 1 year ago

That builds a ggml-metal.o:

tommy@tymax-2 llm % find ./target/release | grep metal    
./target/release/build/ggml-sys-1cf394792f1e4b44/out/llama-cpp/ggml-metal.o

Also the binary now links Metal:

tommy@tymax-2 llm % otool -L target/release/llm
target/release/llm:
    /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1500.65.0)
    /System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 1.0.0, current version 60420.101.2)
    /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 1971.0.0)
    /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 1971.0.0)
    /System/Library/Frameworks/Metal.framework/Versions/A/Metal (compatibility version 1.0.0, current version 306.5.16)
    /System/Library/Frameworks/MetalKit.framework/Versions/A/MetalKit (compatibility version 1.0.0, current version 157.0.0)
    /System/Library/Frameworks/MetalPerformanceShaders.framework/Versions/A/MetalPerformanceShaders (compatibility version 1.0.0, current version 126.3.5)
    /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)

All inference is still happening on the CPU though.

darxkies commented 1 year ago

I am still trying to figure out why the Metal code was not compiled earlier. I made some changes to restore the functionality I removed previously. Can you please let me know if ggml-metal is still being successfully compiled?

And I know now why you have noticed no improvements at all. When ggml_init is called by llm, cuBLAS and CLBlast are initialized if enabled. That is not the case for Metal. Yet.

The first step is to enable Metal in ggml so that it can be used by llm.

pixelspark commented 1 year ago

Still builds fine!

So what would be needed now in order to add Metal support? I suspect most calls will be similar, we just need some if cfg!(feature="metal") && use_metal { .. } else { .. } for these (ggml_init -> ggml_metal_init, ggml_graph_compute -> ggml_metal_graph_compute)?

darxkies commented 1 year ago

Nice! Thank you.

I did take a brief look at llama.cpp only and while I am not very familiar with either llama.cpp or llm I would say the context in llm needs to be expanded to hold the structure returned by ggml_metal_init and some of the ggml_X APIs need to be replaced with the Metal ones as you mentioned earlier.

pixelspark commented 1 year ago

OK, I might have a go at that if I can find the time. Probably not so easy to do for you without access to a machine to test on. Having the build sorted out is a great first step, thanks!

radu-matei commented 1 year ago

FYI, the latest commit from the PR stops linking the Metal library properly. Resetting to 8666654c0ff641badbdb9cdfe1c11462abb0d171 links it again (but as @pixelspark mentioned, inferencing happens on the CPU still).

darxkies commented 1 year ago

@pixelspark Can you confirm that it does not work anymore?

pixelspark commented 1 year ago

Will test tonight (not sure if related but the latest master branch does not build on Linux either with GCC<8. That however seems to be an issue in GGML which is already being fixed: https://github.com/ggerganov/llama.cpp/issues/1279)

darxkies commented 1 year ago

Which distribution is it?

pixelspark commented 1 year ago

Which distribution is it?

Just some old Ubuntu. I should be able to work around this using Docker

@pixelspark Can you confirm that it does not work anymore?

I checked out 0d8810058f51e1f9a6e575e0976d5fd00799f124 and that builds and works just fine on macOS...

pixelspark commented 1 year ago

@darxkies Wait a minute, I am actually getting this (after adding some basic code to construct a Metal context):

  = note: Undefined symbols for architecture arm64:
            "_ggml_metal_free", referenced from:
                _$LT$ggml..metal..MetalContext$u20$as$u20$core..ops..drop..Drop$GT$::drop::hcbede1209645a6ad in libggml-f2423c67974c335d.rlib(ggml-f2423c67974c335d.ggml.301d3698-cgu.8.rcgu.o)
            "_ggml_metal_init", referenced from:
                ggml::context::Context::init::h91b873fed7f42fb9 in libggml-f2423c67974c335d.rlib(ggml-f2423c67974c335d.ggml.301d3698-cgu.0.rcgu.o)
          ld: symbol(s) not found for architecture arm64
          clang: error: linker command failed with exit code 1 (use -v to see invocation)

My commit on top of yours: https://github.com/pixelspark/llm/tree/pedal-to-the-metal

Edit: the fix is trivial, see https://github.com/pixelspark/llm/commit/b647e5c16da2381e561516d56e81785cf4bb2d23

darxkies commented 1 year ago

@LLukas22 Does it make sense to include that fix in my PR?

LLukas22 commented 1 year ago

@darxkies Yeah include it, we probably first want to merge the build stuff and then add the actual implementations after that.

darxkies commented 1 year ago

@pixelspark Thank you for the fix. I added it to my PR.

philpax commented 1 year ago

Would you say this is done with your PR, @pixelspark ?

pixelspark commented 1 year ago

Yes, obviously :-)

Though we still need to keep tracking ggml as GPU support in general and the Metal implementation is still in flux. @LLukas22 keeps an eye on this I presume.