Open lwtgithublwt opened 5 months ago
What exactly does the [1,512] feature obtained by the clip encoder mean, and how does it become a lattice of channels, length, and width?
What exactly does the [1,512] feature obtained by the clip encoder mean, and how does it become a lattice of channels, length, and width?