Open redna11 opened 3 years ago
After running the JAX code and using the structure information in the research paper, I obtain the following number of parameters for each task:
ListOps: 19.9M Text: 3.5M Retrieval: 1.087M Image: 380K Pathfinder: 315K
Could you kindly confirm that this is indeed correct, so that a fair comparison can be done with alternative models.
Thanks/
Hi, could you tell me how you count the parameters of the model on cifar10 task? After running, my model has just 50k parameters. Thanks!
The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file.
This constraint is not satisfied in the current code. Using default parameters for the Image/CIFAR10 task I found:
Transformer # params: 52 266 Performer #params: 248 458
The Performer model thus doesn't satisfy the 10% constraint, it has more than 400% times the parameters of the Transformer model. I suspect this is due to wrong hyperparameters
Transformer: emb_dim: 32, mlp_dim: 64, num_heads: 1, qkv_dim: 32 Performer: emb_dim: 128 mlp_dim: 128 num_heads: 8 qkv_dim: 64
Everything is larger, this obviously leads to more parameters. Again, I apologize to the authors if this is due to any misconception on my part.
Hi @redna11,
can you tell which MLP dim you considered when calculated the size of a text classification model? It seems that it was 512, while I see 1024 in the LRA config. The information from the paper is also misleading:
All xformer models are parameterized by the same number of layers, heads and hidden dimensions, namely 8 heads, 512 hidden dimensions and d = 2048 for positional FFN layers.
Basically, the hyperparameters in the paper seem to be doubled.
Hello,
you mention: "The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file"
Do you publish what those baseline number of params are respectively for each task?
Thanks