srush / llama2.rs

A fast llama2 decoder in pure Rust.
MIT License
995 stars 54 forks source link

Build script #14

Closed rachtsingh closed 10 months ago

rachtsingh commented 10 months ago

This one is maybe just an idea - Cargo doesn't seem to support key/value features, but that's the best way to let crates that depend on this choose the model size. So here I set Cargo features (e.g. "7B") and then the build.rs converts that to a k/v pair (model_size = "7B"). I'm not sure the conversion is necessary but I didn't want to mess too much with your existing setup.

Separately this sets AVX512 as a enable-able feature that errors if the host machine doesn't support AVX512. My home computer doesn't have it, so I haven't checked it.

srush commented 10 months ago

Oh this is neat. Let me find some time to learn about build.rs before I merge.

I'm not actually sure avx512 adds anything though. I was reading about it and it seems like rust prefers 256 bits simd.

srush commented 10 months ago

Okay, think we should merge this with the following changes:

rachtsingh commented 10 months ago
  1. Sounds good, will do in a second.
  2. I think if you run cargo build --release it'll use the build script by default (should also be picked up by rust-analyzer when running with VS Code)
  3. Yeah, I think that looks like:
cargo run --release -F 13B,quantized

with the above script?

Let me push a commit that fixes 1, and also lets this build on Mac M1 (it's very slow there, but will compile just fine). I'm not sure about the interface overall - maybe this would all be better using envvars? i.e.

GROUP_SIZE=32 BITS=4 MODEL_SIZE=70B cargo build/run --release
srush commented 10 months ago

Did you add any new targets for the M1? It looks like we need Neon somehow?

rachtsingh commented 10 months ago

So I think Neon gets added by default if you set target-cpu=native. According to this Stackoverflow post, you can see that via rustc --print=cfg -C target-cpu=native, which on my Mac mini has target_feature="neon".

Actually, I think AVX/AVX2/FMA are also turned on by default, since that shows up when I run the above command on my desktop (x86_64). Do you know if enabling those manually changes anything?

In the most recent commit I removed the AVX512 stuff since it seems like there isn't really a usecase right now, and the code was getting messy (the x86 support check doesn't work on M1, so you have to hide that check behind a cfg check...).

I think to get this to actually work well on the M1 you need to link to Apple's Accelerate framework and their implementation of BLAS. I don't think anyone is really doing this, but it seems like a fun weekend hack. There might be an easier solution that's being missed because the NEON codegen is bad, I haven't really looked at it.