Reproduction of best trial when loading pre-trained models

ofersabo commented 4 years ago

I believe I encountered the same issue as mentioned in issue number 975#.

I'm running distributed experiments across servers to find the best hyperparameters. When I try to reproduce the best experiment, I copy all the parameters from the config file (I'm using the optuna with allennlp) including seed, and for some reason, I can't reproduce this exact same results ( I also run the same experiment on the same server). reproduction is fine when I don't use the optuna distributed trial, so I can replicate experiments when I don't use the optuna package. Also it doesn't happen when I use my own model, i.e creating my own pytorch model and experiment it within the optuna distributed trial. This happens only when I load a pre-trained model ( typically from huggingface). Any ideas why this may happen?

This is the related issue which was discussed in: https://github.com/optuna/optuna/issues/975#issuecomment-594482556_

I'll try to see if the model parameters are different at the beginning of each trial.

The code looks like this

def objective(trial):
    lr = trial.suggest_loguniform('lr', 1e-6, 1e-3)
    pct_start = trial.suggest_uniform('pct_start', 0.05, 0.5)
    b1 = trial.suggest_uniform('b1', 0.7, 0.9)
    b2 = trial.suggest_uniform('b2', 0.6, 0.98)
    eps = trial.suggest_categorical('eps', [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6])
    wd = trial.suggest_loguniform('wd', 1e-8, 1e-3)

    # destroy old learner
    try: 
        learn.destroy()
    except:
        print('no learner created')

    learn = Learner(data_clas,
        custom_transformer_model,
        opt_func = lambda input: AdamW(input,correct_bias=False, eps=eps),
        loss_func = FlattenedLoss(LabelSmoothingCrossEntropy, axis=-1),
        metrics = [accuracy],
        wd = wd,
        callback_fns=[partial(FastAIPruningCallback, trial=trial, monitor='accuracy')])
    # For roberta-base
    list_layers = [learn.model.transformer.roberta.embeddings,
                learn.model.transformer.roberta.encoder.layer[0],
                learn.model.transformer.roberta.encoder.layer[1],
                learn.model.transformer.roberta.encoder.layer[2],
                learn.model.transformer.roberta.encoder.layer[3],
                learn.model.transformer.roberta.encoder.layer[4],
                learn.model.transformer.roberta.encoder.layer[5],
                learn.model.transformer.roberta.encoder.layer[6],
                learn.model.transformer.roberta.encoder.layer[7],
                learn.model.transformer.roberta.encoder.layer[8],
                learn.model.transformer.roberta.encoder.layer[9],
                learn.model.transformer.roberta.encoder.layer[10],
                learn.model.transformer.roberta.encoder.layer[11],
                learn.model.transformer.roberta.pooler]

    learn.split(list_layers)
    learn.load('initial')
    learn.unfreeze() 
    learn.to_fp16()
    learn.fit_one_cycle(1,
                lr,
                pct_start = pct_start,
                moms = (b1, b2))

    return learn.validate()[-1].item() # returns accuracy

Where custom_transformer_model is loading a pre-trained ROBERTA model. I even made sure to check that the fastai learner was being destroyed at the end of each trial. When I print out the model weights at the before training in each trial, I get something like

However, if I save the initial model weights beforehand and load them at the beginning of each trial with learn.load('initial'), the weights remain consistent at each trial

However, this seems to only an issue with using a learner with a custom model, as I have used optuna with a tabular_learner without the need to reset weights between trials, so it could be an issue with the fastai library

Edit: I did some further updates with the fastai learner. It seems that the model weight updates persist even when you destroy the learner and create a new one, so for a custom model, you need to either create a new model or reset weights between trials

Originally posted by @maxmatical in https://github.com/optuna/optuna/issues/975#issuecomment-594482556

toshihikoyanase commented 4 years ago

@ofersabo Thank you for your report. I'm not sure of the exact cause of your problem, but I saw some potential causes which prevent reproduction.

Numerical precision of parameter values stored in RDBStorage
Effect of the previous trial

Numerical precision of parameter values stored in `RDBStorage`

I think you used RDBStorage with an RDB server to execute distributed trials. Some RDB servers like MySQL does not store full-precision of parameter values. Please take a look at the following example:

import optuna

def objective(trial):
    return trial.suggest_float("x", 0, 1)

def run(storage):
    study = optuna.create_study(storage=storage)
    study.optimize(objective, n_trials=2)
    study.enqueue_trial(study.best_params)
    study.optimize(objective, n_trials=1)

run(storage=None)
run(storage="mysql+pymysql://user:password@localhost/optunatest")

# InMemoryStorage
[I 2020-07-07 14:20:02,807] Trial 0 finished with value: 0.7449180515757554 and parameters: {'x': 0.7449180515757554}. Best is trial 0 with value: 0.7449180515757554.
[I 2020-07-07 14:20:02,808] Trial 1 finished with value: 0.9220510742298788 and parameters: {'x': 0.9220510742298788}. Best is trial 0 with value: 0.7449180515757554.
[I 2020-07-07 14:20:02,810] Trial 2 finished with value: 0.7449180515757554 and parameters: {'x': 0.7449180515757554}. Best is trial 0 with value: 0.7449180515757554.

# RDBStorage (MySQL)
[I 2020-07-07 14:20:03,565] Trial 0 finished with value: 0.3715065131507872 and parameters: {'x': 0.3715065131507872}. Best is trial 0 with value: 0.371507.
[I 2020-07-07 14:20:03,746] Trial 1 finished with value: 0.10545282062440076 and parameters: {'x': 0.10545282062440076}. Best is trial 1 with value: 0.105453.
[I 2020-07-07 14:20:04,237] Trial 2 finished with value: 0.105453 and parameters: {'x': 0.105453}. Best is trial 1 with value: 0.105453.

This code executes two trials and reproduces the best one as Trial 2. When I used InMemoryStorage, Trial 2 shows the same objective value as the best trial. On the other hand, RDBStorage stores 0.10545282062440076 as 0.105453, and the objective value is slightly different from the best one.

Similar phenomena can be seen in the AllenNLP example if we change the storage.

[I 2020-07-07 13:12:33,063] Trial 0 finished with value: 0.49704 and parameters: {'DROPOUT': 0.32995516922267865, 'EMBEDDING_DIM': 35, 'MAX_FILTER_SIZE': 5, 'NUM_FILTERS': 17, 'HIDDEN_SIZE': 22}. Best is trial 0 with value: 0.49704.
[I 2020-07-07 13:13:41,165] Trial 1 finished with value: 0.48316 and parameters: {'DROPOUT': 0.010314982351137314, 'EMBEDDING_DIM': 23, 'MAX_FILTER_SIZE': 4, 'NUM_FILTERS': 20, 'HIDDEN_SIZE': 16}. Best is trial 0 with value: 0.49704.
[I 2020-07-07 13:16:02,643] Trial 2 finished with value: 0.50004 and parameters: {'DROPOUT': 0.329955, 'EMBEDDING_DIM': 35, 'HIDDEN_SIZE': 22, 'MAX_FILTER_SIZE': 5, 'NUM_FILTERS': 17}. Best is trial 2 with value: 0.50004.

Effect of the previous trial

As reported in https://github.com/optuna/optuna/issues/975#issuecomment-594482556, some clutter of the previous trial may affect the current trial. If so, the trials are not independent and cannot be reproduced with a single trial execution. I think it depends on the software you used, so could you share more information about your objective function? The following information will be a great help of us:

AllenNLP version
Optuna version
Dataset
Version and models of huggingface/transformerspretraining
Do you use Optuna's AllenNLPExecutor?

ofersabo commented 4 years ago

Hi @toshihikoyanase Many thanks for the help and for this detailed comment.

I believe that the problem I encountered arises from your second point.

About the first point, I see actual differences between the saved model from optuna and the saved model from my reproduction process. Also, when I re-load the model and execute evaluation on any dataset I see differences with the results, these differences are larger than a point so it's isn't an issue of floating-point.

RE the second point:

allennlp==0.8.5 optuna==1.5.0 pytorch-transformers==1.1.0 I'm using the BERT-BASE-CASE model Yes, I'm using the allennlp executed. to be precise here is the code I'm using: optuna.integration.allennlp.AllenNLPExecutor

The dataset is a public dataset for academic usage. it is called TACRED. It's a relation classification dataset.

toshihikoyanase commented 4 years ago

@ofersabo Thank you for your quick response.

Also, when I re-load the model and execute evaluation on any dataset I see differences with the results, these differences are larger than a point so it's isn't an issue of floating-point.

May I ask how to re-load the model? I want to know your workflow of the hyperparameter tuning. I guess your workflow consists of the following steps:

Create a Jsonnet file for AlleNLPExecutor
Run AllenNLPExecutor
Fill the best params to the Jsonnet file
Reproduce the best trial using the allennlp train command

If so, I'd like to know if we can get the same value when we run step 4 twice.

Yes, I'm using the allennlp executed. to be precise here is the code I'm using: optuna.integration.allennlp.AllenNLPExecutor

@himkt Do you have any ideas about this issue? For example, I'm curious if AllenNLPExecutor can somehow store information as global variables.

ofersabo commented 4 years ago

I load the model with the allennlp evaluate command, which under the hood eventually uses the model.load_state_dict() pytorch implementation.

@toshihikoyanase the reproduction steps which you mentioned are accurate. If run step 4 twice I get the same results which are different than the original results which were acquired from the optuna trial.

himkt commented 4 years ago

@ofersabo

Sorry for the late reply. I'm wondering you ran the experiments on GPU. If training is run on GPU (with CuDNN), the results could be varying between runs. (ref. https://github.com/allenai/allennlp/issues/387)

I conducted the experiment here based on the Optuna x AllenNLP example. In the log output, I confirmed the results are varying between runs on GPU, and the results are identical between runs CPU.

Could you please tell me if the problem happens even on CPU? And if you found some notable points in your script, please let me know.

ofersabo commented 4 years ago

Hi @himkt Thanks for the reply.

the experiments were done on GPUs, I used 4 GPUs. Yea, I can try running these trials on the CPU, however, this may take long as I'm using BERT, actually I'm not even sure if this is feasible. will keep you updated.

Even if this doesn't happen on the CPU, (i.e. models can be reproduced using the CPU) as it appears in your attached log output, I still think that there is an issue with the optuna wrapping package. The evidence is that I can reproduce experiments that were run directly from the allennlp config file.

himkt commented 4 years ago

Thank you @ofersabo for the quick response!

The evidence is that I can reproduce experiments that were run directly from the allennlp config file.

Hmm... in my environment, allennlp train produces different results on multiple runs. I used the following allennlp config.

{
    "dataset_reader": {
        "lazy": false,
        "token_indexers": {
            "tokens": {
                "lowercase_tokens": true,
                "type": "single_id"
            }
        },
        "tokenizer": {
            "word_splitter": "just_spaces"
        },
        "type": "text_classification_json"
    },
    "datasets_for_vocab_creation": [
        "train"
    ],
    "iterator": {
        "batch_size": 10,
        "type": "basic"
    },
    "model": {
        "dropout": 0,
        "seq2vec_encoder": {
            "embedding_dim": 34,
            "ngram_filter_sizes": [
                1,
                2,
                3,
                4,
                5
            ],
            "num_filters": 30,
            "output_dim": 17,
            "type": "cnn"
        },
        "text_field_embedder": {
            "token_embedders": {
                "tokens": {
                    "embedding_dim": 34
                }
            }
        },
        "type": "basic_classifier"
    },
    "numpy_seed": 42,
    "pytorch_seed": 42,
    "random_seed": 42,
    "train_data_path": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/imdb/dev.jsonl",
    "trainer": {
        "cuda_device": 0,
        "num_epochs": 5,
        "num_serialized_models_to_keep": 1,
        "optimizer": {
            "lr": 0.1,
            "type": "adam"
        },
        "patience": 2,
        "validation_metric": "+accuracy"
    },
    "validation_data_path": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/imdb/test.jsonl"
}

Could you please tell me if you find out the difference between my config and your config? (Except to the model architecture)

Also, I want to ask you whether the model you have a consistency problem, and the model you used for the experiment are the same. I'm wondering whether the model using BERT includes atomic operations and the model you used for the reproduction doesn't include atomic operation. (https://pytorch.org/docs/stable/notes/randomness.html) If a model contains atomic operation, it could produce the consistency result across multiple runs. @ofersabo Is it possible to share model architectures in detail? But I'm not familiar with CuDNN, @hvy please give me any advice if you know something.

ofersabo commented 4 years ago

yes, these are the same models. The one used by optuna and the one used directly by allennlp which I can reproduce, are the same model.

many things are different between our config files cause we use a completely different model and setting. the essential parts are not very different as can be seen yourself.

you can find the config file in the repository, the link is hereunder.

OK, you may spot on the issue. I do use torch.bincount() in my metric evaluation which as stated in the provided link uses nondeterministic addition operations. Although when I use this function the information is already stored on the CPU, I use it to measure the F1 score. Also, I still wonder what makes the model reproducible outside the optuna env. Yep, I can share the model architecture as well as the code.

It's not a straightforward model that can be understood immediately so if you have any questions please let me know. repo with the config file and the code for running the model. https://github.com/ofersabo/share_with_optuna

himkt commented 4 years ago

Thank you for sharing your repo! I've gone through your implementation. However, I couldn't run your code since I couldn't get TACRED (and dataset reader). I'd be really happy if you give me a small code with full reproduction, which is consists of standard AllenNLP components.

github-actions[bot] commented 4 years ago

This issue has not seen any recent activity.

himkt commented 4 years ago

Let me close this PR, as there is no activity now. Please feel free to reopen as needed.

optuna / optuna