Closed pcuenca closed 10 months ago
Do you have a sense of what range of values for k
and arr.count
we want to be optimizing for? If not, I can try running swift-chat
with a few models and see what values are typical.
hi, I've created a little benchmark repo to compare different implementations https://github.com/jkrukowski/topk
Look like BNNS.NearestNeighbors
is the slowest option here (I might be using it wrong though)
@jkrukowski Very cool, thanks for that! 🙌
Actually, I was confused when I read the release notes and the implementation I linked in this issue is a nearest-neighbor selection, not top-k
. The top-k functions in BNNS
are described here.
From your analysis, it looks like the current implementation is faster for small numbers, but degrades for larger values of k
. The usual scenario for these models is a vocabulary of a few tens of thousands of tokens. For example, Llama 2 uses 32000 tokens while Falcon uses about 65000.
For typical values of k
, I'd say most of the time it's a small number (10 or less), and I think it'd be rare for it to be larger than 100. If you have time, could you maybe run your benchmark with values within those ranges? I'd also suggest to remove the softmax
operation from your benchmark, as it would be constant among all the implementations.
@jkrukowski Very cool, thanks for that! 🙌
Actually, I was confused when I read the release notes and the implementation I linked in this issue is a nearest-neighbor selection, not
top-k
. The top-k functions inBNNS
are described here.From your analysis, it looks like the current implementation is faster for small numbers, but degrades for larger values of
k
. The usual scenario for these models is a vocabulary of a few tens of thousands of tokens. For example, Llama 2 uses 32000 tokens while Falcon uses about 65000.For typical values of
k
, I'd say most of the time it's a small number (10 or less), and I think it'd be rare for it to be larger than 100. If you have time, could you maybe run your benchmark with values within those ranges? I'd also suggest to remove thesoftmax
operation from your benchmark, as it would be constant among all the implementations.
Adjusted https://github.com/jkrukowski/topk please take a look. I've added additional XCTest performance benchmark. Google benchmark is not very conclusive but according to XCTest BNNS.applyTopK
seems to be the best performer
@pcuenca given the above do you think that implementation present in https://github.com/jkrukowski/topk/blob/f2c164e5c360a6a055b785db8830b78c38936e12/Sources/TopK/Accelerate.swift#L7 would be ok? If so I could create a PR
@jkrukowski Sounds great!
I ran it on my system (M1 Max) and observed similar results, but wondered why the Google Benchmark shows results so different. I moved the array creation inside the measurement loop to invalidate any potential caching that might be in place. The differences are smaller (as they are dominated by the overhead of array creation), but the Accelerate implementation is still faster, so it looks safe to go ahead!
I'd just recommend to add a few top-k tests to the repo, if possible, to ensure nothing changes before and after – and because it'd be awesome to have them! 😇.
Thanks a lot for working on this, it's great to bring performance closer to the greedy case!
@pcuenca added here https://github.com/huggingface/swift-transformers/pull/21
Reference: https://developer.apple.com/documentation/accelerate/bnns#4164142
We should use it when permitted by the underlying OS, and compare its performance with the current implementation: https://github.com/huggingface/swift-transformers/commit/d2917b2b7f687592886aeb3d36eec7e2694e6b17