Closed ghost closed 10 months ago
let x = Input()
let c = Input()
let tokeys = Dense(count: k * h, noBias: true)
let toqueries = Dense(count: k * h, noBias: true)
let tovalues = Dense(count: k * h, noBias: true)
var queries = toqueries(x).reshaped([b, hw, h, k]).identity().identity()
var keys = tokeys(c).reshaped([b, t, h, k]).identity()
var values = tovalues(c).reshaped([b, t, h, k])
let scaledDotProductAttention = ScaledDotProductAttention(
scale: 1.0 / Float(k).squareRoot(), multiHeadOutputProjectionFused: true)
var out = scaledDotProductAttention(queries, keys, values)
/* Alternatively:
var out = scaledDotProductAttention(queries, keys, values).reshaped([b, hw, h * k])
let unifyheads = Dense(count: k * h)
out = unifyheads(out)
*/
return Model([x, c], [out])
To use 6-bit weights in memory, you need to add .jit
option when load a model:
graph.openStore("some path") {
$0.read("unet", model: unet, codec: [.q6p, .q8p, .jit, .ezm7])
}
Thanks a lot, you are the best!
let keys = tokeys(x).reshaped([b, hw, h, k]).transposed(1, 2)
let queries = ((1.0 / Float(k).squareRoot()) * toqueries(x)).reshaped([b, hw, h, k])
.transposed(1, 2)
let values = tovalues(x).reshaped([b, hw, h, k]).transposed(1, 2)
var dot = Matmul(transposeB: (2, 3))(queries, keys)
dot = dot.reshaped([b * h * hw, hw])
dot = dot.softmax()
dot = dot.reshaped([b, h, hw, hw])
var out = dot * values
out = out.reshaped([b, h, hw, k]).transposed(1, 2).reshaped([b, hw, h * k])
let unifyheads = Dense(count: k * h)
out = unifyheads(out)
vs
let keys = tokeys(x).reshaped([b, hw, h, k]).identity()
var queries = ( toqueries(x)).reshaped([b, hw, h, k]).identity().identity()
var values = tovalues(x).reshaped([b, hw, h, k])
let scaledDotProductAttention = ScaledDotProductAttention(
scale: 1.0 / Float(k).squareRoot(), multiHeadOutputProjectionFused: true)
var out = scaledDotProductAttention(queries, keys, values).reshaped([b, hw, h * k])
let unifyheads = Dense(count: k * h)
out = unifyheads(out)
this seems to change the number of weight layers in the model. how?
multiHeadOutputProjectionFused option fuses the unifyheads into the SDP op, hence there is no need to have the unifyheaders in the later line.
let keys = tokeys(x).reshaped([b, hw, h, k]).transposed(1, 2)
let queries = ((1.0 / Float(k).squareRoot()) * toqueries(x)).reshaped([b, hw, h, k])
.transposed(1, 2)
let values = tovalues(x).reshaped([b, hw, h, k]).transposed(1, 2)
var dot = Matmul(transposeB: (2, 3))(queries, keys)
dot = dot.reshaped([b * h * hw, hw])
dot = dot.softmax()
dot = dot.reshaped([b, h, hw, hw])
var out = dot * values
out = out.reshaped([b, h, hw, k]).transposed(1, 2).reshaped([b, hw, h * k])
let unifyheads = Dense(count: k * h)
out = unifyheads(out)
vs
let keys = tokeys(x).reshaped([b, hw, h, k]).identity()
var queries = ( toqueries(x)).reshaped([b, hw, h, k]).identity().identity()
var values = tovalues(x).reshaped([b, hw, h, k])
let scaledDotProductAttention = ScaledDotProductAttention(
scale: 1.0 / Float(k).squareRoot(), multiHeadOutputProjectionFused: true)
var out = scaledDotProductAttention(queries, keys, values).reshaped([b, hw, h * k])
these 2 still are not equivalent in terms of output
SDP only works for .NHWC shape.
I have tensors of size batch,time,heads,emb_dim , whats the shape SDP expects?
It expects tensor.format = .NHWC
any way to cast the tensor inside model creation?
I had changed the Unet.swift and made the convs format: .OIHW
any way to cast the tensor inside model creation?
.reshaped(.NHWC(?, ?, ?, ?))
That works, thanks
1) How to use Metal Flash Attention with the UNet model? 2) Also, is there any way I could only load the 6bit weights in the memory rather than 16bit?