pbelcak / UltraFastBERT

The repository for the code of the UltraFastBERT paper
MIT License
510 stars 33 forks source link

FFF-BERT seems to run slower than a vanilla BERT model #1

Closed p-i- closed 10 months ago

p-i- commented 10 months ago

Boyan and I performance-tested the FFF-BERT (on HuggingFace) against a vanilla BERT of similar size, and found that it performs maybe 15% more slowly on my M2 mac.

https://gist.github.com/p-i-/355668983aaeee3f282977cdfb93017c

This seems surprising, as the benchmarks do indeed demonstrate a ~50x speedup for a single feed-forward layer:

#!/bin/bash

echo "🔸 Batch size 100"
echo "naive FF (batch matmult)"
python main.py  --model ff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 100  --n-iters 10  --device cpu

echo "FFF (batch matmult)"
python main.py  --model fff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 100  --n-iters 10  --device cpu

echo "🔸 Batch size 10"
echo "naive FF (batch matmult)"
python main.py  --model ff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 10  --n-iters 10  --device cpu

echo "FFF (batch matmult)"
python main.py  --model fff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 10  --n-iters 10  --device cpu

echo "🔸 Batch size 1"

echo "naive FF (batch matmult)"
python main.py  --model ff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 1  --n-iters 10  --device cpu

echo "FFF (batch matmult)"
python main.py  --model fff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 1  --n-iters 10  --device cpu
> . run.sh 
🔸 Batch size 100
naive FF (batch matmult)
eager: 1.3852830000000003
compile: 1.366022000000001
(eval) compiled: 1.3960490000000003 ± 0.03737091447636828
~~~~~~~~~~
FFF (batch matmult)
eager: 0.05451000000000006
compile: 0.018572000000000255
(eval) compiled: 0.01893820000000006 ± 0.0015569136006856079
~~~~~~~~~~
🔸 Batch size 10
naive FF (batch matmult)
eager: 0.141181
compile: 0.1446900000000002
(eval) compiled: 0.1389437 ± 0.0026585709714055
~~~~~~~~~~
FFF (batch matmult)
eager: 0.005520000000000191
compile: 0.001954999999999707
(eval) compiled: 0.002634200000000009 ± 0.0015433838667031883
~~~~~~~~~~
🔸 Batch size 1
naive FF (batch matmult)
eager: 0.01369599999999993
compile: 0.01478299999999999
(eval) compiled: 0.014860099999999932 ± 0.0014923330358871411
~~~~~~~~~~
FFF (batch matmult)
eager: 0.0005589999999999762
compile: 0.0005690000000000417
(eval) compiled: 0.0003634999999999167 ± 7.71248987033366e-05
~~~~~~~~~~

Speedups for batchsize 100 10 1:

In [1]: 1.3471607 / 0.019425799999999917, 0.14026139999999993 / 0
   ...: .0023557000000000716, 0.014105299999999899 / 0.0003394000
   ...: 0000001194
Out[1]: (69.34904611393127, 59.54128284586138, 41.5595167943412)
pbelcak commented 10 months ago

Hi @p-i-,

As noted in the introduction section of the preprint, the model provided through HuggingFace is only a simulation of the conditionality. In the code itself, you will find that the FFF implementation available to the HuggingFace model is only masking out all neurons that are not being used for the particular inference instance.

That is why you are not seeing any meaningful improvement :)