stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.27k stars 891 forks source link

Word Embedding miss-match for English pre-trianed NER model #339

Closed JackGodfrey122 closed 4 years ago

JackGodfrey122 commented 4 years ago

There appears to be a miss-match of dimension size for the word embeddings for the pretrained English NER model. The model definition states that the word embedding dimension is 300, but i was under the impression that this NER model was trained using the CoNLL17 Corpus word vectors, which have dimension 100.

After loading the NER model using torch.load, we can see that the dimension is indeed 300. However, i can resume training this NER model using dimension 100 vectors. Was this just a typo in the original model config?

I may be miss understanding something crucial here, but any verification would be helpful.

yuhui-zh15 commented 4 years ago

Hi @JackGodfrey122, we use 100-d GloVe vectors to train our NER models. Can you share the exact code that you verify the dimension and resume training the NER model?

JackGodfrey122 commented 4 years ago

Hi @yuhui-zh15.

Firstly thanks for such a quick reply, i really appreciate it.

Before i explain my code, here is my environment:

Python Version: 3.7.6 requirements.txt: certifi==2020.4.5.1 chardet==3.0.4 future==0.18.2 idna==2.9 numpy==1.18.5 protobuf==3.12.2 requests==2.23.0 six==1.15.0 stanza==1.0.1 torch==1.5.0 tqdm==4.46.1 urllib3==1.25.9

I have pasted two files here:

# resume_training.py

import sys
import os
import time
from datetime import datetime
import argparse
import logging
import numpy as np
import random
import json
import torch
from torch import nn, optim

from stanza.models.ner.data import DataLoader
from stanza.models.ner.trainer import Trainer
from stanza.models.ner import scorer
from stanza.models.common import utils
from stanza.models.common.pretrain import Pretrain
from stanza.utils.conll import CoNLL
from stanza.models.common.doc import *
from stanza.models import _training_logging

logger = logging.getLogger('stanza')

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_dir', type=str, default='data/ner', help='Root dir for saving models.')
    parser.add_argument('--wordvec_dir', type=str, default='extern_data/word2vec', help='Directory of word vectors')
    parser.add_argument('--wordvec_file', type=str, default='', help='File that contains word vectors')
    parser.add_argument('--train_file', type=str, default=None, help='Input file for data loader.')
    parser.add_argument('--eval_file', type=str, default=None, help='Input file for data loader.')
    parser.add_argument('--base_model', type=str, default=None, help='Path for base model.')
    parser.add_argument('--charlm_forward_file', type=str, default=None, help='Path for character lm forward model.')
    parser.add_argument('--charlm_backward_file', type=str, default=None, help='Path for character lm backward model.')

    parser.add_argument('--mode', default='train', choices=['train', 'predict'])
    parser.add_argument('--lang', type=str, help='Language')
    parser.add_argument('--shorthand', type=str, help="Treebank shorthand")

    parser.add_argument('--hidden_dim', type=int, default=256)
    parser.add_argument('--char_hidden_dim', type=int, default=100)
    parser.add_argument('--word_emb_dim', type=int, default=100)
    parser.add_argument('--char_emb_dim', type=int, default=100)
    parser.add_argument('--num_layers', type=int, default=1)
    parser.add_argument('--char_num_layers', type=int, default=1)
    parser.add_argument('--pretrain_max_vocab', type=int, default=100000)
    parser.add_argument('--word_dropout', type=float, default=0)
    parser.add_argument('--locked_dropout', type=float, default=0.0)
    parser.add_argument('--dropout', type=float, default=0.5)
    parser.add_argument('--rec_dropout', type=float, default=0, help="Word recurrent dropout")
    parser.add_argument('--char_rec_dropout', type=float, default=0, help="Character recurrent dropout")
    parser.add_argument('--char_dropout', type=float, default=0, help="Character-level language model dropout")
    parser.add_argument('--no_char', dest='char', action='store_false', help="Turn off character model.")
    parser.add_argument('--charlm', action='store_true', help="Turn on contextualized char embedding using character-level language model.")
    parser.add_argument('--charlm_save_dir', type=str, default='saved_models/charlm', help="Root dir for pretrained character-level language model.")
    parser.add_argument('--charlm_shorthand', type=str, default=None, help="Shorthand for character-level language model training corpus.")
    parser.add_argument('--char_lowercase', dest='char_lowercase', action='store_true', help="Use lowercased characters in charater model.")
    parser.add_argument('--no_lowercase', dest='lowercase', action='store_false', help="Use cased word vectors.")
    parser.add_argument('--no_emb_finetune', dest='emb_finetune', action='store_false', help="Turn off finetuning of the embedding matrix.")
    parser.add_argument('--no_input_transform', dest='input_transform', action='store_false', help="Do not use input transformation layer before tagger lstm.")
    parser.add_argument('--scheme', type=str, default='bioes', help="The tagging scheme to use: bio or bioes.")

    parser.add_argument('--sample_train', type=float, default=1.0, help='Subsample training data.')
    parser.add_argument('--optim', type=str, default='sgd', help='sgd, adagrad, adam or adamax.')
    parser.add_argument('--lr', type=float, default=0.1, help='Learning rate.')
    parser.add_argument('--min_lr', type=float, default=1e-4, help='Minimum learning rate to stop training.')
    parser.add_argument('--momentum', type=float, default=0, help='Momentum for SGD.')
    parser.add_argument('--lr_decay', type=float, default=0.5, help="LR decay rate.")
    parser.add_argument('--patience', type=int, default=3, help="Patience for LR decay.")

    parser.add_argument('--max_steps', type=int, default=200000)
    parser.add_argument('--eval_interval', type=int, default=500)
    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--max_grad_norm', type=float, default=5.0, help='Gradient clipping.')
    parser.add_argument('--log_step', type=int, default=20, help='Print log every k steps.')
    parser.add_argument('--save_dir', type=str, default='saved_models/ner', help='Root dir for saving models.')
    parser.add_argument('--save_name', type=str, default=None, help="File name to save the model")

    parser.add_argument('--seed', type=int, default=1234)
    parser.add_argument('--cuda', type=bool, default=torch.cuda.is_available())
    parser.add_argument('--cpu', action='store_true', help='Ignore CUDA.')
    args = parser.parse_args()
    return args

def main():
    args = parse_args()

    torch.manual_seed(args.seed)
    np.random.seed(args.seed)
    random.seed(args.seed)
    if args.cpu:
        args.cuda = False
    elif args.cuda:
        torch.cuda.manual_seed(args.seed)

    args = vars(args)
    logger.info("Running tagger in {} mode".format(args['mode']))

    if args['mode'] == 'train':
        train(args)
    else:
        evaluate(args)

def train(args):

    # load pretrained vectors
    vec_file = utils.get_wordvec_file(args['wordvec_dir'], args['shorthand'])

    # do not save pretrained embeddings individually
    pretrain = Pretrain(None, vec_file, args['pretrain_max_vocab'], save_to_file=False)

    if args['charlm']:
        if args['charlm_shorthand'] is None: 
            logger.info("CharLM Shorthand is required for loading pretrained CharLM model...")
            sys.exit(0)

    # load data
    logger.info("Loading data with batch size {}...".format(args['batch_size']))
    train_doc = Document(json.load(open(args['train_file'])))
    train_batch = DataLoader(train_doc, args['batch_size'], args, pretrain, evaluation=False)
    vocab = train_batch.vocab
    dev_doc = Document(json.load(open(args['eval_file'])))
    dev_batch = DataLoader(dev_doc, args['batch_size'], args, pretrain, vocab=vocab, evaluation=True)
    dev_gold_tags = dev_batch.tags

    # skip training if the language does not have training or dev data
    if len(train_batch) == 0 or len(dev_batch) == 0:
        logger.info("Skip training because no data available...")
        sys.exit(0)

    logger.info("Training tagger...")
    trainer = Trainer(args=args, vocab=vocab, pretrain=pretrain, use_cuda=args['cuda'], model_file=args['base_model'])
    logger.info(trainer.model)

    global_step = 0
    max_steps = args['max_steps']
    dev_score_history = []
    best_dev_preds = []
    current_lr = trainer.optimizer.param_groups[0]['lr']
    global_start_time = time.time()
    format_str = '{}: step {}/{}, loss = {:.6f} ({:.3f} sec/batch), lr: {:.6f}'

    # LR scheduling
    if args['lr_decay'] > 0:
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(trainer.optimizer, mode='max', factor=args['lr_decay'], \
            patience=args['patience'], verbose=True, min_lr=args['min_lr'])
    else:
        scheduler = None

    # start training
    train_loss = 0
    while True:
        should_stop = False
        for i, batch in enumerate(train_batch):
            start_time = time.time()
            global_step += 1
            loss = trainer.update(batch, eval=False) # update step
            train_loss += loss
            if global_step % args['log_step'] == 0:
                duration = time.time() - start_time
                logger.info(format_str.format(datetime.now().strftime("%Y-%m-%d %H:%M:%S"), global_step,\
                        max_steps, loss, duration, current_lr))

            if global_step % args['eval_interval'] == 0:
                # eval on dev
                logger.info("Evaluating on dev set...")
                dev_preds = []
                for batch in dev_batch:
                    preds = trainer.predict(batch)
                    dev_preds += preds
                _, _, dev_score = scorer.score_by_entity(dev_preds, dev_gold_tags)

                train_loss = train_loss / args['eval_interval'] # avg loss per batch
                logger.info("step {}: train_loss = {:.6f}, dev_score = {:.4f}".format(global_step, train_loss, dev_score))
                train_loss = 0

                # save best model
                if len(dev_score_history) == 0 or dev_score > max(dev_score_history):
                    trainer.save(model_file)
                    logger.info("New best model saved.")
                    best_dev_preds = dev_preds

                dev_score_history += [dev_score]
                logger.info("")

                # lr schedule
                if scheduler is not None:
                    scheduler.step(dev_score)

            # check stopping
            current_lr = trainer.optimizer.param_groups[0]['lr']
            if global_step >= args['max_steps'] or current_lr <= args['min_lr']:
                should_stop = True
                break

        if should_stop:
            break

        train_batch.reshuffle()

    logger.info("Training ended with {} steps.".format(global_step))

    best_f, best_eval = max(dev_score_history)*100, np.argmax(dev_score_history)+1
    logger.info("Best dev F1 = {:.2f}, at iteration = {}".format(best_f, best_eval * args['eval_interval']))

if __name__ == "__main__":
    main()

resume_training.py: So here I have essentially copied and pasted your ner_tagger.py file, but removed the evaluate function and a few checks. I have also added 3 new arguments:

  1. _basemodel : Path to the ontonotes.pt model.
  2. _charlm_forwardmodel : Path to the forward 1billion.pt model.
  3. _charlm_backwardmodel: Path to the backward 1billion.pt model.

On line 134, I have added the argument model_file, so that the trainer attempts to load the base_model and resume from that (is this functionality correct?). The forward and backward charlm models are handled by the args that are passed in.

trainer = Trainer(args=args, vocab=vocab, pretrain=pretrain, use_cuda=args['cuda'], model_file=args['base_model'])

I then pass the following command: python resume_training.py \ --data_dir ./models \ --wordvec_dir /Users/jackgodfrey/stanza_test/extern_data \ --train_file /Users/jackgodfrey/stanza_test/data/train_data.json \ --eval_file /Users/jackgodfrey/stanza_test/data/train_data.json \ --mode train \ --lang English \ --shorthand en_ \ --base_model /Users/jackgodfrey/stanza_resources/en/ner/ontonotes.pt \ --word_emb_dim 300 \ --char_emb_dim 100 \ --charlm_forward_file /Users/jackgodfrey/stanza_resources/en/forward_charlm/1billion.pt \ --charlm_backward_file /Users/jackgodfrey/stanza_resources/en/backward_charlm/1billion.pt \ --char_hidden_dim 1024 \ --max_steps 200 \ --eval_interval 50

Output

Running tagger in train mode
Loading data with batch size 32...
Reading pretrained vectors from /Users/jackgodfrey/stanza_test/extern_data/word2vec/English/en.vectors.xz...
1 batches created.
1 batches created.
Training tagger...
NERTagger(
  (word_emb): Embedding(100000, 300, padding_idx=0)
  (charmodel): CharacterModel(
    (char_emb): Embedding(948, 100, padding_idx=0)
    (charlstm): PackedLSTM(
      (lstm): LSTM(100, 1024, batch_first=True, bidirectional=True)
    )
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (input_transform): Linear(in_features=2348, out_features=2348, bias=True)
  (taggerlstm): PackedLSTM(
    (lstm): LSTM(2348, 256, batch_first=True, bidirectional=True)
  )
  (tag_clf): Linear(in_features=512, out_features=77, bias=True)
  (crit): CRFLoss()
  (drop): Dropout(p=0.5, inplace=False)
  (worddrop): WordDropout(p=0)
  (lockeddrop): LockedDropout(p=0.0)
)
2020-06-05 20:25:47: step 20/200, loss = 2.120621 (0.742 sec/batch), lr: 0.100000
2020-06-05 20:26:02: step 40/200, loss = 0.004887 (0.728 sec/batch), lr: 0.100000
Evaluating on dev set...
Prec.   Rec.    F1
0.00    0.00    0.00
step 50: train_loss = 4.253206, dev_score = 0.0000

However, changing the arg word_emb_dim to 100 produces this

Output:

RuntimeError: Error(s) in loading state_dict for NERTagger:
        size mismatch for word_emb.weight: copying a param with shape torch.Size([100000, 300]) from checkpoint, the shape in current model is torch.Size([100000, 100]).
        size mismatch for input_transform.weight: copying a param with shape torch.Size([2348, 2348]) from checkpoint, the shape in current model is torch.Size([2148, 2148]).
        size mismatch for input_transform.bias: copying a param with shape torch.Size([2348]) from checkpoint, the shape in current model is torch.Size([2148]).
        size mismatch for taggerlstm.lstm.weight_ih_l0: copying a param with shape torch.Size([1024, 2348]) from checkpoint, the shape in current model is torch.Size([1024, 2148]).
        size mismatch for taggerlstm.lstm.weight_ih_l0_reverse: copying a param with shape torch.Size([1024, 2348]) from checkpoint, the shape in current model is torch.Size([1024, 2148]).

check_dims.py Simple script to print out the config of the ontonotes.pt NER model that is downloaded via stanza.download('en')

# check_dims.py

import torch

ner_model = torch.load('/Users/jackgodfrey/stanza_resources/en/ner/ontonotes.pt', map_location='cpu')
fw_model = torch.load('/Users/jackgodfrey/stanza_resources/en/forward_charlm/1billion.pt', map_location='cpu')

print(ner_model['config'])
print(fw_model['args'])

Output:

NER Model config:  {'data_dir': 'data/ner', 'wordvec_dir': './extern_data/wordvec', 'wordvec_file': 'extern_data/fasttext/English/crawl-300d-2M.vec', 'train_file': './data/ner/en_ontonotes.train.json', 'eval_file': './data/ner/en_ontonotes.dev.json', 'mode': 'train', 'lang': 'English', 'shorthand': 'en_ontonotes', 'hidden_dim': 256, 'char_hidden_dim': 1024, 'word_emb_dim': 300, 'char_emb_dim': 100, 'num_layers': 1, 'char_num_layers': 1, 'pretrain_max_vocab': 100000, 'word_dropout': 0, 'locked_dropout': 0.0, 'dropout': 0.5, 'rec_dropout': 0, 'char_rec_dropout': 0, 'char_dropout': 0, 'char': True, 'charlm': True, 'charlm_save_dir': 'saved_models/charlm', 'charlm_shorthand': 'en_1billion', 'char_lowercase': False, 'lowercase': False, 'emb_finetune': False, 'input_transform': True, 'scheme': 'bioes', 'sample_train': 1.0, 'optim': 'sgd', 'lr': 0.1, 'min_lr': 0.0001, 'momentum': 0, 'lr_decay': 0.5, 'patience': 3, 'max_steps': 200000, 'eval_interval': 1800, 'batch_size': 32, 'max_grad_norm': 5.0, 'log_step': 20, 'save_dir': 'saved_models/ner', 'save_name': None, 'seed': 1234, 'cuda': True, 'cpu': False, 'charlm_forward_file': 'saved_models/charlm/en_1billion_forward_charlm.pt', 'charlm_backward_file': 'saved_models/charlm/en_1billion_backward_charlm.pt'} 

Forward Char LM Model:  {'train_file': './data/charlm/en_1billion.train.txt', 'train_dir': 'data/charlm/en_1billion.train', 'eval_file': './data/charlm/en_1billion.dev.txt', 'lang': 'en', 'shorthand': 'en_1billion', 'mode': 'train', 'direction': 'forward', 'char_emb_dim': 100, 'char_hidden_dim': 1024, 'char_num_layers': 1, 'char_dropout': 0.05, 'char_unit_dropout': 1e-05, 'char_rec_dropout': 0.0, 'batch_size': 100, 'bptt_size': 250, 'epochs': 10, 'max_grad_norm': 0.25, 'lr0': 20, 'anneal': 0.25, 'patience': 10, 'weight_decay': 0.0, 'momentum': 0.0, 'report_steps': 50, 'save_name': 'en_1billion_forward_1024d_charlm.pt', 'vocab_save_name': None, 'save_dir': 'saved_models/charlm', 'cuda': True, 'cpu': False, 'seed': 1234}

Conclusion

So we can see that the dimension of the word embeddings is 300 when loaded via torch.load, and also in the model config when resuming training. However the word vectors are dimension 100.

yuhui-zh15 commented 4 years ago

Oh, I see your point. For most NER models, we use the 100-d vectors from conll17. However, for English Ontonotes, we use the fasttext-300d word vectors rather than 100-d vectors from conll17, as we found this will lead to higher performance.