FFF-BERT seems to run slower than a vanilla BERT model

Boyan and I performance-tested the FFF-BERT (on HuggingFace) against a vanilla BERT of similar size, and found that it performs maybe 15% more slowly on my M2 mac.

https://gist.github.com/p-i-/355668983aaeee3f282977cdfb93017c

This seems surprising, as the benchmarks do indeed demonstrate a ~50x speedup for a single feed-forward layer:

#!/bin/bash

echo "🔸 Batch size 100"
echo "naive FF (batch matmult)"
python main.py  --model ff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 100  --n-iters 10  --device cpu

echo "FFF (batch matmult)"
python main.py  --model fff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 100  --n-iters 10  --device cpu

echo "🔸 Batch size 10"
echo "naive FF (batch matmult)"
python main.py  --model ff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 10  --n-iters 10  --device cpu

echo "FFF (batch matmult)"
python main.py  --model fff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 10  --n-iters 10  --device cpu

echo "🔸 Batch size 1"

echo "naive FF (batch matmult)"
python main.py  --model ff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 1  --n-iters 10  --device cpu

echo "FFF (batch matmult)"
python main.py  --model fff_bmm  --input-width 8000  --hidden-width 4000  --output-width 8000  --depth 8  --batch-size 1  --n-iters 10  --device cpu

> . run.sh 
🔸 Batch size 100
naive FF (batch matmult)
eager: 1.3852830000000003
compile: 1.366022000000001
(eval) compiled: 1.3960490000000003 ± 0.03737091447636828
~~~~~~~~~~
FFF (batch matmult)
eager: 0.05451000000000006
compile: 0.018572000000000255
(eval) compiled: 0.01893820000000006 ± 0.0015569136006856079
~~~~~~~~~~
🔸 Batch size 10
naive FF (batch matmult)
eager: 0.141181
compile: 0.1446900000000002
(eval) compiled: 0.1389437 ± 0.0026585709714055
~~~~~~~~~~
FFF (batch matmult)
eager: 0.005520000000000191
compile: 0.001954999999999707
(eval) compiled: 0.002634200000000009 ± 0.0015433838667031883
~~~~~~~~~~
🔸 Batch size 1
naive FF (batch matmult)
eager: 0.01369599999999993
compile: 0.01478299999999999
(eval) compiled: 0.014860099999999932 ± 0.0014923330358871411
~~~~~~~~~~
FFF (batch matmult)
eager: 0.0005589999999999762
compile: 0.0005690000000000417
(eval) compiled: 0.0003634999999999167 ± 7.71248987033366e-05
~~~~~~~~~~

Speedups for batchsize 100 10 1:

In [1]: 1.3471607 / 0.019425799999999917, 0.14026139999999993 / 0
   ...: .0023557000000000716, 0.014105299999999899 / 0.0003394000
   ...: 0000001194
Out[1]: (69.34904611393127, 59.54128284586138, 41.5595167943412)

pbelcak / UltraFastBERT

FFF-BERT seems to run slower than a vanilla BERT model #1