Closed leexinhao closed 1 year ago
I notice that in your paper bigger embed dim don't necessarily work better, but it (16) still very small compared to other work (AdaptFormer is 64, AIM is 256). As I understand it, larger dimensions only increase the burden of training and not the burden of reasoning due to structural re-parameterization, so maybe try a larger embed dim can lead to better performance without loss of efficiency.
Dimension sizes seem to depend on the task and dataset. On VTAB-1K, larger dimensions (>8) will degenerate the performance. On video classification, we use a larger dimension (16), and achieve better results than the small one (2 and 8).
I notice that in your paper bigger embed dim don't necessarily work better, but it (16) still very small compared to other work (AdaptFormer is 64, AIM is 256). As I understand it, larger dimensions only increase the burden of training and not the burden of reasoning due to structural re-parameterization, so maybe try a larger embed dim can lead to better performance without loss of efficiency.
Dimension sizes seem to depend on the task and dataset. On VTAB-1K, larger dimensions (>8) will degenerate the performance. On video classification, we use a larger dimension (16), and achieve better results than the small one (2 and 8).
Have you ever tried a larger dim like 64 or 128?
I notice that in your paper bigger embed dim don't necessarily work better, but it (16) still very small compared to other work (AdaptFormer is 64, AIM is 256). As I understand it, larger dimensions only increase the burden of training and not the burden of reasoning due to structural re-parameterization, so maybe try a larger embed dim can lead to better performance without loss of efficiency.
Dimension sizes seem to depend on the task and dataset. On VTAB-1K, larger dimensions (>8) will degenerate the performance. On video classification, we use a larger dimension (16), and achieve better results than the small one (2 and 8).
Have you ever tried a larger dim like 64 or 128?
No, i think the dim of 64 or 128 will perform worse than 8 on VTAB-1K, but probably better on video classification.
Thanks for your reply and nice work!
I notice that in your paper bigger embed dim don't necessarily work better, but it (16) still very small compared to other work (AdaptFormer is 64, AIM is 256). As I understand it, larger dimensions only increase the burden of training and not the burden of reasoning due to structural re-parameterization, so maybe try a larger embed dim can lead to better performance without loss of efficiency.