Closed ghost closed 1 year ago
It is flatten because the Embed / IndexSelect only accept tensor in the shape without batch (and it doesn't matter because by the nature of the op). Thus, it will end up with the shape [b*L, C]. It is later split up in other ops: https://github.com/liuliu/swift-diffusion/blob/main/src/CLIPTextModel.swift#L25
Thanks
Looks like you are creating embedding for max len L, but you are passing a sequence of length 2*L. Is batching not actually supported? does max length mean anything, as it seems its being violated?