huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.85k stars 26.76k forks source link

Why does the hidden state of the same input token change every time I call the same GPT2 model? #2627

Closed h56cho closed 4 years ago

h56cho commented 4 years ago

Hello,

Say I fixed my input to the GPT2 model:

input_ids = test_i[:,0]
input_ids = torch.tensor(input_ids.tolist()).unsqueeze(0)

Then I try to retrieve the hidden state vector of the last token:

tst_hidden_states = best_model(input_ids)[3][1][0, (test_i.size()[0] - 1), :].detach()
tst_hidden_states[0:5]
>>>tensor([-0.0146,  0.0718, -0.0297, -0.0000, -0.0315])

but when I repeat the above process with the exactly same input, the hidden state of the last token keeps changing:

tst_hidden_states = best_model(input_ids)[3][1][0, (test_i.size()[0] - 1), :].detach()
tst_hidden_states[0:5]
>>> tensor([-0.0146,  0.0000, -0.0297, -0.0212, -0.0315])

Given that I didn't change the model, I don't understand why the hidden state of the same input and the same token keeps changing at each turn. How can I prevent the hidden state from changing?

Thank you,

h56cho commented 4 years ago

Hello,

The hidden state vectors doesn't seem to change with fixed input and token when I use the Hugging Face pre-trained GPT2 model, but in my case, I made and trained my own GPT2 model by doing the following:


bptt = 1024
batch_size = 1
log_int = 50
nlayer = 6

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

gc.set_threshold(700, 10, 10)

# define the English text field
TEXT_ch2 = Field(init_token = '<sos>',
                 eos_token = '<eos>',
                 unk_token = '<unk>',
                 pad_token = '<pad>',
                 fix_length = bptt,
                 lower = True)

# split the PennTreeBank corpus into a train, val, and test set.
train_penn, val_penn, test_penn = torchtext.datasets.PennTreebank.splits(TEXT_ch2)

# initialize new_train_penn
new_train_penn = train_penn

# build vocabulary based on the field that we just defined.
# (building vocabulary over all language datasets)
TEXT_ch2.build_vocab(new_train_penn, val_penn, test_penn,
                     specials=['<sos>','<eos>','<unk>','<pad>','<mask>','<mcoption>','<question>'])

# define special token indices
mask_index_ch2 = TEXT_ch2.vocab.stoi['<mask>']
pad_index_ch2 = TEXT_ch2.vocab.stoi['<pad>']
mcoption_index_ch2 = TEXT_ch2.vocab.stoi['<mcoption>']
question_index_ch2 = TEXT_ch2.vocab.stoi['<question>']
eos_index_ch2 = TEXT_ch2.vocab.stoi['<eos>']
sos_index_ch2 = TEXT_ch2.vocab.stoi['<sos>']
unk_index_ch2 = TEXT_ch2.vocab.stoi['<unk>']

# set hyperparameter ntokens
ntokens = len(TEXT_ch2.vocab.stoi)

## define GPT-2 configuration.
GPT2config_ch2 = GPT2Config(vocab_size_or_config_json_file = ntokens,
                                  cutoffs = [20000, 40000, 200000], 
                                  n_positions = 1024, 
                                  n_embd = 768, 
                                  n_head = 12, 
                                  n_layer = nlayer,
                                  resid_pdrop = 0.1,
                                  embd_pdrop = 0.1,
                                  attn_pdrop = 0.1,
                                  output_hidden_states = True,
                                  output_attentions = True)

# define the GPT-2 model based on the specifiTVD configuration.
model_ch2 = GPT2DoubleHeadsModel(GPT2config_ch2)

# add new tokens to the embeddings of our model
model_ch2.resize_token_embeddings(ntokens)

def train_lm_head(model, train_iter, optimizer, scheduler, log_interval, pad_index):

    # turn on a training mode
    model.train()

    # initialize total_loss to 0
    total_loss = 0

    # list(enumerate(train_penn_iter))[0][1] would extract the 1st batch
    for batch_index, batch in enumerate(train_iter):

        gc.collect()

        input_ids = [instance for instance in batch.text]

        ## NOTE: Positions embeddings can be automatically created by the GPT2DoubleHeadsModel as (0, 1, ..., N)

        # set the gradient back to 0 (necessary step)
        optimizer.zero_grad() 

        input_ids = torch.tensor([input_ids], dtype=torch.long)

        loss = model(input_ids, lm_labels = input_ids)[0]
        # 'loss' here is the cross entropy.
        # recall: 'input_ids' is defined above.

        # calculate gradient by backwarding the loss
        # calculate gradient of the loss w.r.t weights
        loss.backward()

        # clips norm of the gradient of an iterable of parameters.
        # The norm is computed over all gradients together, as if they were
        # concatenated into a single vector. Gradients are modified in-place.
        # so basically just normalizes the gradients and returns them.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)

        optimizer.step() # update the weights by following the constLinearSchedule for the lr.

        # update the with the calculated loss 
        total_loss = total_loss + loss 

        # python format: 's' for string, 'd' to display decimal integers (10-base), and 'f' for floats.
        # ex: print("Sammy ate {0:.3f} percent of a pizza!".format(75.765367))
        #     >> Sammy ate 75.765 percent of a pizza!
        #     print("Sammy ate {0:f} percent of a {1}!".format(75, "pizza"))
        #     >> Sammy ate 75.000000 percent of a pizza! 
        #
        # Below is good enough since we are doing the Stochastic Gradient Descent.
        # (i.e. 1 batch = 1 sample)

        if batch_index % log_interval == 0 and batch_index > 0:
            cur_loss = total_loss / log_interval
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.9f} | loss {:5.4f} | ppl {:8.4f}'.format(
                    epoch, batch_index, len(train_iter), scheduler.get_lr()[0], cur_loss, math.exp(cur_loss)))

            total_loss = 0 

        del input_ids
        del loss
        gc.collect()                 

# evaluate (Apply the best model) to check the result with the validation dataset.
def evaluate_lm_head(model, val_iter, pad_index):
    model.eval() # Turn on the evaluation mode
    total_loss = 0.
    with torch.no_grad():

        for batch_index, batch in enumerate(val_iter):

            gc.collect()

            val_input_ids = [instance for instance in batch.text]
            val_input_ids = torch.tensor([val_input_ids], dtype=torch.long)

            ## NOTE: Positions embeddings can be automatically created by the GPT2DoubleHeadsModel as (0, 1, ..., N)
            loss =  model(val_input_ids, lm_labels = val_input_ids)[0]
            total_loss = total_loss + loss

            del val_input_ids
            del loss
            gc.collect()

    return total_loss / (len(val_iter) - 1)

# loop over epoch to find the best model (the best GPT2 language model based on pennTreeBank) 
optimizer_ch2 = AdamW(model_ch2.parameters(), lr = 0.00000485, correct_bias = True)

scheduler_ch2 = get_constant_schedule(optimizer = optimizer_ch2, last_epoch = -1)

best_val_loss = float("inf")
epochs = 5 # The total number of epochs ... since the treebank is reasonably large-scale, 5 epoch (>1) is likely to be enough
           # see: https://stackoverflow.com/questions/38000189/is-it-ok-to-only-use-one-epoch

# initialize best_model_ch2_penn to None
best_model_ch2_penn = None

for epoch in range(1, epochs + 1):

    gc.collect()

    epoch_start_time = time.time()

    # again, log_interval = 1 for Stochastic Gradient Descent
    train_lm_head(model_ch2, train_penn_iter, 
                  optimizer_ch2, scheduler_ch2, 
                  log_int, pad_index_ch2)

    val_loss = evaluate_lm_head(model_ch2, val_penn_iter, 
                                pad_index_ch2)

    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.4f} | '
          'valid ppl {:8.4f}'.format(epoch, (time.time() - epoch_start_time),
                                     val_loss, math.exp(val_loss)))
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = model_ch2

    gc.collect()

    scheduler_ch2.step() # update the learning rate

When I use the best_model that I obtain from this train function, and pass in the same input, the hidden state of the last token keeps changing each time I compute it. How can I prevent this?

Would saving the best_model as pre-trained model and re-loading it prevent the hidden state from changing? If so, what is the code to save and re-load the best_model as a pre-trained model? I am having a hard time following the documentation, as I am just a beginner.

Thank you,

BramVanroy commented 4 years ago

This is too much code for me to debug now. But generally, inconsistent inference is caused by not setting your model to evaluation mode. Do model.eval() before retrieving your vector. This will disable dropout/norm (and dropout is pseudorandom, so that may cause inconsistent results).

h56cho commented 4 years ago

Thank you! This solved my problem. Is it necessary to include model.eval() before retrieving loss to update the weights in my train() function? or should I NOT use model.eval() in my train() function, because the dropout and the norm needs to be applied during the training (which I am not so sure on)?

Thank you,

BramVanroy commented 4 years ago

This is more a "deep learning with PyTorch" question than a transformers question, so I'll be brief. If you have more question, please ask the question on Stack Overflow.

.eval() is used when you are not training, i.e. when you wish to get deterministic values from your model. This is typically done during evaluation and testing. When you are training, though, you want those things such as dropout because it has been shown that they are beneficial for the training process (e.g. combat overfitting). To ensure that the model is using dropout etc. you should put in back into training mode (in contrast to evaluating mode) by setting model.train().

In addition to eval() vs .train(), there is also the grad vs no_grad difference. During training, weights require_grad, which tells PyTorch that gradients need to be calculated for those parameters. As you can imagine, that is a computationally expensive step, which we don't need during testing/evaluating. So we can disable gradient calculation with a context manager torch.no_grad().

So, in practice your code could look something like this (but it might look different, or you might use steps instead of epochs, etc.). (Note, this is pseudo code.)

for epoch in range(n_epochs):
    # train
    model.train()
    for batch in train_loader:
        out = model(batch)
        ...
    # evaluate
    model.eval()
    with torch.no_grad():
        for batch in eval_loader:
            out = model(batch)
            ...
...
# test
model.eval()
with torch.no_grad():
    for batch in test_loader:
        test = model(batch)
        ...

Again, if you have more detailed questions concerning, please ask them on Stack Overflow.

h56cho commented 4 years ago

Thank you for all your help, I appreciate it!