Frame level classification?

Hi, thanks a lot for your interest in our work!

I interpreted your question in the following ways, and sharing my response to each of them. In case you might have further questions, feel free to ask!

Does AuM training / inference support frame-shaped patches (e.g. size melbins x # of temporal frames)?
- Yes, it supports, you can achieve this by appropriately adjusting the patch size and the stride parameters!
Could AuM be trained for supporting temporal event localization or temporal segment classification?
- Yes, but you would need to modify the loss computation and the architecture. One way of achieving this might be through not using the cls token during the training but instead using all the patches and applying average or max pooling over their transformations after the final AuM Block for a temporal representation to aid such a classification / localization task.

kaistmm / Audio-Mamba-AuM

Frame level classification? #2