google-research / long-range-arena

Long Range Arena for Benchmarking Efficient Transformers
Apache License 2.0
710 stars 77 forks source link

ListOps performance #15

Open dido1998 opened 3 years ago

dido1998 commented 3 years ago

On running the ListOps task as-is from the repo, I got a validation performance similar to that reported in the paper but the test performance on results.json is very low:

{"accuracy": 0.17500001192092896, "loss": 3.032956123352051, "perplexity": 20.758506774902344}

I saw that the code is saving the model from the last checkpoint as compared to the model with the best validation performance. Could you detail the evaluation setup used in the paper i.e. for the paper do you evaluate the model from the last checkpoint of from the best validation checkpoint?

Thank you very much! :-)

sihyun-yu commented 3 years ago

Have you solved this problem? I have a similar issue.

dido1998 commented 3 years ago

Hi @sihyun-yu, I was not able to solve it

apuaaChen commented 3 years ago

I got a similar issue with the transformer_base. The evaluation accuracy curve is a little bit weird. The highest accuracy reaches 0.3359 at step 2.5k, then it drops to < 0.2. I directly use the default configurations. image

jinfengr commented 3 years ago

It seems the issue is fixed with the latest code push. Please add comment if the issue still comes up.

renebidart commented 3 years ago

I found either lowering the learning rate or increasing the batch size was useful for this task. I think their hyperparameters are for a large effective batch size because they run on TPU.

BalloutAI commented 2 years ago

I am still getting the same problem, my validation during training is high on the listops, but when running test_only option, I am getting very low accuracy!

BalloutAI commented 2 years ago

The problem is that the data is shuffled every time the code is ran, so the tokens are changed when running the test script giving a random accuracy.

yuzhenmao commented 1 year ago

The problem is that the data is shuffled every time the code is ran, so the tokens are changed when running the test script giving a random accuracy.

@BalloutAI Hi, I also found this issue: high training accuracy, low test accuracy; I also found if I run training process multiple times, sometimes the model cannot even converge. Could you explain your idea a little bit more? Thank you.