Generation: use the new top-k implementation in Accelerate

pcuenca commented 1 year ago

Reference: https://developer.apple.com/documentation/accelerate/bnns#4164142

We should use it when permitted by the underlying OS, and compare its performance with the current implementation: https://github.com/huggingface/swift-transformers/commit/d2917b2b7f687592886aeb3d36eec7e2694e6b17

brow commented 1 year ago

Do you have a sense of what range of values for k and arr.count we want to be optimizing for? If not, I can try running swift-chat with a few models and see what values are typical.

jkrukowski commented 10 months ago

hi, I've created a little benchmark repo to compare different implementations https://github.com/jkrukowski/topk

Look like BNNS.NearestNeighbors is the slowest option here (I might be using it wrong though)

pcuenca commented 10 months ago

@jkrukowski Very cool, thanks for that! 🙌

Actually, I was confused when I read the release notes and the implementation I linked in this issue is a nearest-neighbor selection, not top-k. The top-k functions in BNNS are described here.

From your analysis, it looks like the current implementation is faster for small numbers, but degrades for larger values of k. The usual scenario for these models is a vocabulary of a few tens of thousands of tokens. For example, Llama 2 uses 32000 tokens while Falcon uses about 65000.

For typical values of k, I'd say most of the time it's a small number (10 or less), and I think it'd be rare for it to be larger than 100. If you have time, could you maybe run your benchmark with values within those ranges? I'd also suggest to remove the softmax operation from your benchmark, as it would be constant among all the implementations.

jkrukowski commented 10 months ago

@jkrukowski Very cool, thanks for that! 🙌

Actually, I was confused when I read the release notes and the implementation I linked in this issue is a nearest-neighbor selection, not top-k. The top-k functions in BNNS are described here.

From your analysis, it looks like the current implementation is faster for small numbers, but degrades for larger values of k. The usual scenario for these models is a vocabulary of a few tens of thousands of tokens. For example, Llama 2 uses 32000 tokens while Falcon uses about 65000.

For typical values of k, I'd say most of the time it's a small number (10 or less), and I think it'd be rare for it to be larger than 100. If you have time, could you maybe run your benchmark with values within those ranges? I'd also suggest to remove the softmax operation from your benchmark, as it would be constant among all the implementations.

Adjusted https://github.com/jkrukowski/topk please take a look. I've added additional XCTest performance benchmark. Google benchmark is not very conclusive but according to XCTest BNNS.applyTopK seems to be the best performer

jkrukowski commented 10 months ago

@pcuenca given the above do you think that implementation present in https://github.com/jkrukowski/topk/blob/f2c164e5c360a6a055b785db8830b78c38936e12/Sources/TopK/Accelerate.swift#L7 would be ok? If so I could create a PR

pcuenca commented 10 months ago

@jkrukowski Sounds great!

I ran it on my system (M1 Max) and observed similar results, but wondered why the Google Benchmark shows results so different. I moved the array creation inside the measurement loop to invalidate any potential caching that might be in place. The differences are smaller (as they are dominated by the overhead of array creation), but the Accelerate implementation is still faster, so it looks safe to go ahead!

I'd just recommend to add a few top-k tests to the repo, if possible, to ensure nothing changes before and after – and because it'd be awesome to have them! 😇.

Thanks a lot for working on this, it's great to bring performance closer to the greedy case!

jkrukowski commented 10 months ago

@pcuenca added here https://github.com/huggingface/swift-transformers/pull/21

huggingface / swift-transformers

Generation: use the new top-k implementation in Accelerate #7