huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.79k stars 949 forks source link

BERT example is slower than huggingface transformer with larger model on M1 MacBook Pro #1062

Open edfix opened 1 year ago

edfix commented 1 year ago

I benchmarked computing sentence embeddings with e5.py script small model = intfloat/e5-small-v2, result:

image

larger model=BAAI/bge-large-zh-v1.5, result:

image

Pure rust BERT example is still slower than huggingface transformer with BAAI/bge-large-zh-v1.5 model

LaurentMazare commented 1 year ago

Thanks for reporting this, would you be able to provide more details on how you got these values? Ideally the full command line/code so that we can reproduce this on our side? This would make it much easier to investigate.

edfix commented 1 year ago

@LaurentMazare thanks for quick reply! the following is the code and detail commands: small model = intfloat/e5-small-v2, code: benchmark-bert-small.py

maturin develop -r 
python benchmark-bert-small.py 
image

large model = BAAI/bge-large-zh-v1.5, code: benchmark-bert-large.py

python benchmark-bert-large.py 
image

pure rust code:main.rs

cargo run  --example bert --features accelerate --release -- --model-id BAAI/bge-large-zh-v1.5
image
LaurentMazare commented 1 year ago

Thanks for the repros, I didn't get a chance to look at it as I don't have proper internet connectivity so cannot check the bge-large model on my mac - I'll do so when I'm back to having proper broadband. In the meantime, you can try running with the --tracing flag and then upload the generated trace-...json file in chrome performance tab if you want to check where the time is being spent (otherwise I'll have a look at exactly this when properly back online).

edfix commented 1 year ago

@LaurentMazare I tried with the --tracing flag, result as follows:

image

I seems like that the linear layer of BertIntermediate costs the most time.So I simulate the linear layer of BertIntermediate with different model, small model = intfloat/e5-small-v2, bias=true, result:

cargo run  --example benchmark-linear --features accelerate --release -- --num-tokens 512 --in-features=384  --out-features=1024 --bias
image

small model = intfloat/e5-small-v2, bias=false, result:

cargo run  --example benchmark-linear --features accelerate --release -- --num-tokens 512 --in-features=384  --out-features=1024
image

large model = BAAI/bge-large-zh-v1.5, bias=true, result:

cargo run  --example benchmark-linear --features accelerate --release -- --num-tokens 512 --in-features=1024  --out-features=4096 --bias
image

large model = BAAI/bge-large-zh-v1.5, bias=false, result:

cargo run  --example benchmark-linear --features accelerate --release -- --num-tokens 512 --in-features=1024  --out-features=4096
image

Compare with Pytorch, code: benchmark-linear.py, result:

python benchmark-linear.py
image

The above results show that:

  1. the linear layer of Pytorch almost costs the same time between having bias and having no bias
  2. the linear layer of Candle costs the most time on adding bias
LaurentMazare commented 1 year ago

Thanks, that's a very interesting analysis and it's great to have some easy way to reproduce the slowness. I would certainly not have expected adding the bias to make such a difference. When using the accelerate backend, adding the bias should use the vDSP_vadd function over the hood, I would have hoped for this function to be well optimized but maybe it's not the case - in particular it may well be that this function uses a single core and we could add some multi-threading to it but Apple's documentation doesn't say much.

edit: it indeed seems that vDSP_vadd is performing pretty poorly. On my macbook pro M2 when using --in-features=1024 --out-features=4096 I get ~3ms without bias, ~11ms with bias using the current vDSP_vadd version, and ~5ms removing it and so using non-vectorized single threaded addition. I'll take a stab at adding neon based addition and hopefully that will reduce the overhead further.

LaurentMazare commented 1 year ago

Ah actually I digged a bit more into this and it turns out that the benchmark is probably not representative of the actual computation in bert: using zeros as bias results in some inefficiency because of the zero element being broadcasted and so the vectorized op cannot apply properly. If instead you tweak your benchmark code adding a .contiguous() to the bias so that the zeros are not broadcasted anymore, the overhead of the bias becomes very small as expected.

 let bias = Some(Tensor::zeros((out_feature,), DType::F32, &Device::Cpu)?.contiguous()?);

I think the zeros being broadcasted has bitten us quite a few times in the past so I will revert this to being a full array of zeros by default. It will be less efficient memory wise but also a lot less error prone.

edfix commented 1 year ago

thanks for your correction!After further investigation, I guess it's the Bert's gelu_erf that make it run slower . I replaced it with tanh gelu, It runs faster, but still slower than huggingface transformer,

cargo run  --example bert --features accelerate --release -- --model-id BAAI/bge-large-zh-v1.5
image

Then I benchmarked erf between Candle and pytorch candle rust code:code

cargo run  --example benchmark-gelu --features accelerate --release -- --num-tokens 512  --out-features=4096
image

pytorch code : code

python benchmark-gelu.py
image

It shows that candle's erf implement is much slower than pytorch.

LaurentMazare commented 1 year ago

Ah good point, candle erf is very inefficient, the approximation is computed in a very brute force way, no simd or multi-threading (and not any accelerate or mkl kernels if these were to exist). At least on the x86 side there are some avx instructions that we could use here to fasten things up, not sure if something similar exist on M1/M2 though.