Pathfinder task cannot converge. - Githubissues

google-research / long-range-arena

Long Range Arena for Benchmarking Efficient Transformers

Apache License 2.0

710 stars 77 forks source link

Pathfinder task cannot converge. #37

Closed liuyang148 closed 3 years ago

liuyang148 commented 3 years ago

I try to run pathfinder32 based on this dataset and run 5 times, 3 out of 5 cannot converge, and the loss keep 0.6933 until the end, but 2 of them can converge normally, and get final acc of 75%(bigbird). It is pretty random. Then I try different models(performer), and it never converge again. But the cifar10 task, which using the same train code with pathfinder32, converge all the times. It's the problem of the dataset?

I0831 15:32:43.578928 140327465420608 train.py:276] eval in step: 16224, loss: 0.6931, acc: 0.5017
I0831 15:33:00.912956 140327465420608 train.py:242] train in step: 16536, loss: 0.6932, acc: 0.5011
I0831 15:33:02.813938 140327465420608 train.py:276] eval in step: 16536, loss: 0.6932, acc: 0.4983
I0831 15:33:21.293757 140327465420608 train.py:242] train in step: 16848, loss: 0.6931, acc: 0.5018
I0831 15:33:23.183998 140327465420608 train.py:276] eval in step: 16848, loss: 0.6932, acc: 0.4983
I0831 15:33:41.210031 140327465420608 train.py:242] train in step: 17160, loss: 0.6932, acc: 0.4997
I0831 15:33:43.294295 140327465420608 train.py:276] eval in step: 17160, loss: 0.6931, acc: 0.4983

MostafaDehghani commented 3 years ago

@liuyang148 I think by "coverage" you mean "converge" (or please correct me if I'm wrong)? In that case, I want to say that Pathfinder is a difficult task for transformers (and any other architecture that has no recurrence or an inductive bias for modeling transitivity). So what you're observing is simply the struggle of these models to pick up the task. That's actually one of the main reasons that we included the pathfinder in LRA.

liuyang148 commented 3 years ago

Yes, I mean 'converge', forgive my bad english. Then, which result did the paper record. Only 'converge' one and ignore 'none-converge' results?

MostafaDehghani commented 3 years ago

No problem at all! As far as I remember, you will observe no improvement in the metrics we care about in almost all models after some number of training steps even if the loss is still changing (most of the time fluctuating). So we have chosen to fix the number of epochs to 200. I had runs with 1000 epochs but you don't see significant improvement in the "accuracy".

liuyang148 commented 3 years ago

OK, I got it. Thanks for your help.

jnhwkim commented 3 years ago

@MostafaDehghani I understand the task is difficult to converge and learn. I tried three times with different config.random_seed for Performer, but it keeps failing to converge and test accuracies are around 50%. How can I reproduce the number in the paper, i.e., 77.05 (the best score in Table 1)

yinzhangyue commented 2 years ago

@jnhwkim I encountered the same situation as you.

MostafaDehghani commented 2 years ago

@jnhwkim @yinzhangyue Can you point me to the exact config file you're using in LRA codebase?

yinzhangyue commented 2 years ago

I don't change the config file, here is the base_pathfinder32_config.py.

# Copyright 2021 Google LLC

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Base Configuration."""

import ml_collections

NUM_EPOCHS = 200
TRAIN_EXAMPLES = 160000
VALID_EXAMPLES = 20000

def get_config():
  """Get the default hyperparameter configuration."""
  config = ml_collections.ConfigDict()
  config.batch_size = 512
  config.eval_frequency = TRAIN_EXAMPLES // config.batch_size
  config.num_train_steps = (TRAIN_EXAMPLES // config.batch_size) * NUM_EPOCHS
  config.num_eval_steps = VALID_EXAMPLES // config.batch_size
  config.weight_decay = 0.
  config.grad_clip_norm = None

  config.save_checkpoints = True
  config.restore_checkpoints = True
  config.checkpoint_freq = (TRAIN_EXAMPLES //
                            config.batch_size) * NUM_EPOCHS // 2
  config.random_seed = 0

  config.learning_rate = .001
  config.factors = 'constant * linear_warmup * cosine_decay'
  config.warmup = (TRAIN_EXAMPLES // config.batch_size) * 1
  config.steps_per_cycle = (TRAIN_EXAMPLES // config.batch_size) * NUM_EPOCHS

  # model params
  config.model = ml_collections.ConfigDict()
  config.model.num_layers = 1
  config.model.num_heads = 2
  config.model.emb_dim = 32
  config.model.dropout_rate = 0.1

  config.model.qkv_dim = config.model.emb_dim // 2
  config.model.mlp_dim = config.model.qkv_dim * 2
  config.model.attention_dropout_rate = 0.1
  config.model.classifier_pool = 'MEAN'
  config.model.learn_pos_emb = False

  config.trial = 0  # dummy for repeated runs.
  return config

yinzhangyue commented 2 years ago

My Run Script.

PYTHONPATH="$(pwd)":"$PYTHON_PATH" python lra_benchmarks/image/train.py \
      --config=lra_benchmarks/image/configs/pathfinder32/performer_base.py \
      --model_dir=./tmp/pathfinder_F \
      --task_name=pathfinder32_hard

MostafaDehghani commented 2 years ago

I just checked and seems the configs in the repo is not synced with the internal config that we have for getting the results in the paper. Not sure what went wrong, but sorry for that. I'll work on updating the repo, but in the meantime, here are the configs that you should use in the performer config file to be able to get the reported score:

def get_config():
  """Get the default hyperparameter configuration."""
  config = base_pathfinder32_config.get_config()
  config.model_type = "performer"

  config.model.num_layers = 1
  config.model.num_heads = 8
  config.model.emb_dim = 128
  config.model.dropout_rate = 0.2
  config.model.qkv_dim = 64
  config.model.mlp_dim = 128

  return config

yinzhangyue commented 2 years ago

Thank you! I will try it immediately.

yinzhangyue commented 2 years ago

It works! Thank you very much! ^o^