Open CacatuaAlan opened 1 month ago
Hi! I also find that change will obviously influence the training speed. This paper gives a theoretical discussion about this maybe you could refer to. (https://arxiv.org/abs/2407.07279). But what I really concern is that according to this perspective, although the training cost may growth, the model will converge faster. And in my openion, that means less epoch but better test accuracy? But finally I cannot get a very significantly better performamce.....
I have a 4-stage network, and considering that each stage has a different number of tokens, I want to set different sizes for d_state, e.g., [256, 128, 64, 32]. However, I noticed that the training time has significantly slowed down. Is this a normal phenomenon? Does changing d_state in this way make sense?