Closed p-i- closed 10 months ago
Hi @p-i-,
As noted in the introduction section of the preprint, the model provided through HuggingFace is only a simulation of the conditionality. In the code itself, you will find that the FFF implementation available to the HuggingFace model is only masking out all neurons that are not being used for the particular inference instance.
That is why you are not seeing any meaningful improvement :)
Boyan and I performance-tested the FFF-BERT (on HuggingFace) against a vanilla BERT of similar size, and found that it performs maybe 15% more slowly on my M2 mac.
https://gist.github.com/p-i-/355668983aaeee3f282977cdfb93017c
This seems surprising, as the benchmarks do indeed demonstrate a ~50x speedup for a single feed-forward layer:
Speedups for batchsize 100 10 1: