Open abhigoku10 opened 2 years ago
Hi @abhigoku10, thank you for your interest in our work!
It is okay to do visualization on CoaT using CAM / GradCAM. However, if your aim is to visualize the attention map in CoaT, it might be a bit difficult: there is no explicit attention map in our attention mechanism since we compute the product of K and V first, thus you may not be able to extract the attention map directly. However, you can mimic the standard self-attention and manually compute the product of Q and K to generate the attention map.
This is because we have different channel settings between CoaT and CoaT-Lite. We try to align the parameters of CoaT and CoaT-Lite for roughly head-to-head comparison, but there still could be some gap. You may find that in Tiny and Mini models, CoaT has slightly less parameters, but in Small models, CoaT-Lite has less parameters.
I would suggest to reduce the channels in CoaT-Lite Tiny first. You can try to set a series of ratio t (e.g., t=0.3, 0.5, 0.7, 0.9). Then, multiply all channels by ratio t and train a model (perhaps on a subset of ImageNet if there are not enough computational resources). Draw the curve for validation accuracy of these models and analyze the accuracy drop w.r.t. parameters reduction. You may try to use other ways to reduce the parameters (e.g., reduce blocks or reduce channels in certain blocks) and compare the generated curves to obtain the best practice.
@xwjabc thanks for the response ,
You may try to modify the value of embed_dims
in https://github.com/mlpc-ucsd/CoaT/blob/main/src/models/coat.py#L609
@yix081 @xwjabc thanks for you work, it has helped me a lot but had few queries