Open edfix opened 1 year ago
Thanks for reporting this, would you be able to provide more details on how you got these values? Ideally the full command line/code so that we can reproduce this on our side? This would make it much easier to investigate.
@LaurentMazare thanks for quick reply! the following is the code and detail commands: small model = intfloat/e5-small-v2, code: benchmark-bert-small.py
maturin develop -r
python benchmark-bert-small.py
large model = BAAI/bge-large-zh-v1.5, code: benchmark-bert-large.py
python benchmark-bert-large.py
pure rust code:main.rs
cargo run --example bert --features accelerate --release -- --model-id BAAI/bge-large-zh-v1.5
Thanks for the repros, I didn't get a chance to look at it as I don't have proper internet connectivity so cannot check the bge-large model on my mac - I'll do so when I'm back to having proper broadband. In the meantime, you can try running with the --tracing
flag and then upload the generated trace-...json
file in chrome performance tab if you want to check where the time is being spent (otherwise I'll have a look at exactly this when properly back online).
@LaurentMazare I tried with the --tracing flag, result as follows:
I seems like that the linear layer of BertIntermediate costs the most time.So I simulate the linear layer of BertIntermediate with different model, small model = intfloat/e5-small-v2, bias=true, result:
cargo run --example benchmark-linear --features accelerate --release -- --num-tokens 512 --in-features=384 --out-features=1024 --bias
small model = intfloat/e5-small-v2, bias=false, result:
cargo run --example benchmark-linear --features accelerate --release -- --num-tokens 512 --in-features=384 --out-features=1024
large model = BAAI/bge-large-zh-v1.5, bias=true, result:
cargo run --example benchmark-linear --features accelerate --release -- --num-tokens 512 --in-features=1024 --out-features=4096 --bias
large model = BAAI/bge-large-zh-v1.5, bias=false, result:
cargo run --example benchmark-linear --features accelerate --release -- --num-tokens 512 --in-features=1024 --out-features=4096
Compare with Pytorch, code: benchmark-linear.py, result:
python benchmark-linear.py
The above results show that:
Thanks, that's a very interesting analysis and it's great to have some easy way to reproduce the slowness. I would certainly not have expected adding the bias to make such a difference. When using the accelerate backend, adding the bias should use the vDSP_vadd function over the hood, I would have hoped for this function to be well optimized but maybe it's not the case - in particular it may well be that this function uses a single core and we could add some multi-threading to it but Apple's documentation doesn't say much.
edit: it indeed seems that vDSP_vadd
is performing pretty poorly. On my macbook pro M2 when using --in-features=1024 --out-features=4096
I get ~3ms without bias, ~11ms with bias using the current vDSP_vadd
version, and ~5ms removing it and so using non-vectorized single threaded addition. I'll take a stab at adding neon based addition and hopefully that will reduce the overhead further.
Ah actually I digged a bit more into this and it turns out that the benchmark is probably not representative of the actual computation in bert: using zeros
as bias results in some inefficiency because of the zero element being broadcasted and so the vectorized op cannot apply properly. If instead you tweak your benchmark code adding a .contiguous()
to the bias so that the zeros are not broadcasted anymore, the overhead of the bias becomes very small as expected.
let bias = Some(Tensor::zeros((out_feature,), DType::F32, &Device::Cpu)?.contiguous()?);
I think the zeros being broadcasted has bitten us quite a few times in the past so I will revert this to being a full array of zeros by default. It will be less efficient memory wise but also a lot less error prone.
thanks for your correction!After further investigation, I guess it's the Bert's gelu_erf that make it run slower . I replaced it with tanh gelu, It runs faster, but still slower than huggingface transformer,
cargo run --example bert --features accelerate --release -- --model-id BAAI/bge-large-zh-v1.5
Then I benchmarked erf between Candle and pytorch candle rust code:code
cargo run --example benchmark-gelu --features accelerate --release -- --num-tokens 512 --out-features=4096
pytorch code : code
python benchmark-gelu.py
It shows that candle's erf implement is much slower than pytorch.
Ah good point, candle erf is very inefficient, the approximation is computed in a very brute force way, no simd or multi-threading (and not any accelerate or mkl kernels if these were to exist). At least on the x86 side there are some avx instructions that we could use here to fasten things up, not sure if something similar exist on M1/M2 though.
I benchmarked computing sentence embeddings with e5.py script small model = intfloat/e5-small-v2, result:
larger model=BAAI/bge-large-zh-v1.5, result:
Pure rust BERT example is still slower than huggingface transformer with BAAI/bge-large-zh-v1.5 model