Publish number of parameters for each task

google-research / long-range-arena

Long Range Arena for Benchmarking Efficient Transformers

Apache License 2.0

711 stars 77 forks source link

Publish number of parameters for each task #2

Open redna11 opened 3 years ago

redna11 commented 3 years ago

Hello,

you mention: "The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file"

Do you publish what those baseline number of params are respectively for each task?

Thanks

redna11 commented 3 years ago

After running the JAX code and using the structure information in the research paper, I obtain the following number of parameters for each task:

ListOps: 19.9M Text: 3.5M Retrieval: 1.087M Image: 380K Pathfinder: 315K

Could you kindly confirm that this is indeed correct, so that a fair comparison can be done with alternative models.

Thanks/

shawnkx commented 3 years ago

Hi, could you tell me how you count the parameters of the model on cifar10 task? After running, my model has just 50k parameters. Thanks!

alexmathfb commented 3 years ago

The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file.

This constraint is not satisfied in the current code. Using default parameters for the Image/CIFAR10 task I found:

Transformer # params: 52 266 Performer #params: 248 458

The Performer model thus doesn't satisfy the 10% constraint, it has more than 400% times the parameters of the Transformer model. I suspect this is due to wrong hyperparameters

Transformer: emb_dim: 32, mlp_dim: 64, num_heads: 1, qkv_dim: 32 Performer: emb_dim: 128 mlp_dim: 128 num_heads: 8 qkv_dim: 64

Everything is larger, this obviously leads to more parameters. Again, I apologize to the authors if this is due to any misconception on my part.

vladyorsh commented 2 years ago

Hi @redna11,

can you tell which MLP dim you considered when calculated the size of a text classification model? It seems that it was 512, while I see 1024 in the LRA config. The information from the paper is also misleading:

All xformer models are parameterized by the same number of layers, heads and hidden dimensions, namely 8 heads, 512 hidden dimensions and d = 2048 for positional FFN layers.

Basically, the hyperparameters in the paper seem to be doubled.