Closed mwaseemrandhawa closed 4 months ago
Hi, thanks a lot for your interest in our work!
I interpreted your question in the following ways, and sharing my response to each of them. In case you might have further questions, feel free to ask!
melbins
x # of temporal frames
)?
cls
token during the training but instead using all the patches and applying average or max pooling over their transformations after the final AuM Block for a temporal representation to aid such a classification / localization task.
Can we use this model for frame level classification?