Bert Embedding Extraction

wangxinyu0922 commented 4 years ago

Hi, I'm trying to extract bert features by extract_bert_features.sh. I find that the token features are extracted based on a document-level, which generates embeddings based on a sequence of sentences, am I right?

juntaoy commented 4 years ago

Yes, that's correct, so it requires you to have a fixed batch first. e.g. for most conll2002 2003 data, I use the document (split by '-DOCSTART-') as nature batches. And name the doc_key as train_0, train_1 ... For Spanish, there is no document indicator I took 25 sentences as a batch.

wangxinyu0922 commented 4 years ago

Thank you for the reply. The document-level information is very helpful for the BERT embeddings. In my experiments, using BERT model only with sentence input significantly reduces the F1 score compared with the document level input (drop to 91.0~ on CoNLL 2003 english NER). Have you tried using BERT embeddings without document input on flat NER (or simply compare with the BiLSTM-CRF model without any contextual embeddings)? There are many researches that use sentence-level input in their experiments (e.g. Flair), while some of them use document-level (e.g. BERT).

By the way, does the training/testing code use the document information?

wangxinyu0922 commented 4 years ago

I tried a quick experiment on fastText+char embeddings on both graph-based sequence labeling and sequence-based sequence labeling without tuning the hyper-parameters. And I got 87.97 F1 score with graph-based and 90.03 with sequence-based. Do you have any suggestions about the hyper-parameters to get a better results for graph-based method?

juntaoy commented 4 years ago

So the Bert code I got it from a coref system. I think it does make sense for them to use the document level information. For ner I think that won’t help too much I will try to run one with only sentence level Bert on ontonotes. I added an ablation study to the paper which including one replacing biaffine with crf layer the performance drops around 1% on ontonotes. Without Bert it does drop a lot so it is not about the hypepremeters:)

juntaoy commented 4 years ago

Hi, I did some experiments to confirm the document information (or maybe call it cross sentence as only a 64 token window each side will be used) is very helpful and even put random sentences together helps:) So I first evaluated with embeddings from single sentence only and get 89.7 on OntoNotes, which is a 1.6 reduction when compared to the document-level version.
I then shuffle the sentences in train/dev/test sets within the sets and use shuffled sentences to form batches with a batch size of up to 1000 words and apply the same cross sentence method for extracting BERT embeddings and gots 90.4% which just halfway between the sentence only and document batch version. I think it is mainly because the sentences are relatively short (for OntoNotes we have an average sentence length of 17) so BERT embedding might not work so well on short sentences, hence by put random sentences together you still gain a lot.

wangxinyu0922 commented 4 years ago

Thank you for the reply. Using such technic is very helpful to improve the strength of BERT embeddings. However, the first problem is that such process might be impractical since the BERT embedding extraction takes more than average sentence length times than extracting BERT for only single sentence (for example, on CoNLL NER English, about 30 sentences on average); The second one is, which I mainly concern, that it is not a fair comparison with previous work with sequence labeling approaches for your graph-parsing based approach. Most of previous work uses a sentence level embedding extraction but the graph-parsing approach use a totally different embedding extraction strategy. So it is not clear whether graph-parsing approach is really stronger than the sequence-labeling approach (Otherwise, only the embedding extraction strategy helps the performance but not the method). For a deeper comparison between these approaches, here are several points about the comparisons that I concern:

BERT+Fasttext+Char+BiLSTM+CRF (sequence labeling) VS. BERT+Fasttext+Char+BiLSTM+Parsing || BERT use sentence level extraction
BERT+Fasttext+Char+BiLSTM+CRF (sequence labeling) VS. BERT+Fasttext+Char+BiLSTM+Parsing || BERT use document level extraction
Flair+BERT+Fasttext+Char+BiLSTM+CRF (sequence labeling) VS. Flair+BERT+Fasttext+Char+BiLSTM+Parsing || Flair embeddings can significantly improve the performance of sequence labeling but only improve the performance of parsing moderately, In my previous experiments. BERT use the same extraction strategy for the two methods
Fasttext+Char+BiLSTM+CRF (sequence labeling) VS. Fasttext+Char+BiLSTM+Parsing || without contextual embeddings

Do you have the results of these experiment settings? I think it is necessary to show the advantage of graph-parsing approach

juntaoy commented 4 years ago

So for your first concern, I've evaluated another version of document level BERT embedding extraction, which is much faster, so by set window to 511 and stride to 255, in this way you can run the extraction about 70x faster 30mins vs 35 hours on all OntoNotes, it also fast than sentence level's 2 hours (all done on 1 GTX-1080 ti gpu):) and you get all most the same results 91.0 vs 91.3.

For your second concern, the answer for short is the biaffine is "really":) stronger than CRF, so in the ablation of our paper we did your experiments No 2, biaffine is 0.8 better use the same document level BERT. In thers of fair comparison, the document information is always in the dataset, previous work didn't use it is only because they haven't tried. So I would say it is a fair comparison as I didn't use any additional resource. Also it is practical as well, in a real word case, you are more likely to be given documents than single sentence to extract NEs from them. I will try to find some time this week to run 1 and 4 for CRF, I am not sure 3 as I am not familiar with Flair:)

wangxinyu0922 commented 4 years ago

Thank you for your explanation.

For the BERT embeddings, the new version I think is great for the document level BERT embeddings. I will try it if I work on this topic :).

For sequence labeling, could you try the CRF model with a single BiLSTM layer instead of 3 layers? Single layer BiLSTM is a more usual setting for sequence labeling while 3 layers is a usual setting for parsing. By the way, could you also experiment on the CoNLL English NER for 4? So that we can compare the result I posted above. I think these results without contextual embeddings will be very helpful for future research.

For document-level and sentence-level, I think both settings are practical but I need to make sure different approaches have the same settings in the comparsion.

juntaoy commented 4 years ago

Hmm, we don't have resources to run that many experiments. We only have very 7 GPUs available:) shared by 10+ researchers:( but you are free to run it yourself, here is the code allow you to switch off any embeddings and switch to crf.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os,time,json,threading
import random
import numpy as np
import tensorflow as tf
import h5py

import util

class BiaffineNERModel():
  def __init__(self, config):
    self.config = config
    self.context_embeddings = util.EmbeddingDictionary(config["context_embeddings"])
    self.context_embeddings_size = self.context_embeddings.size

    self.char_embedding_size = config["char_embedding_size"]
    self.char_dict = util.load_char_dict(config["char_vocab_path"])

    if 'without_bert' in self.config and self.config["without_bert"]:
      self.lm_file = None
    else:
      self.lm_file = h5py.File(self.config["lm_path"], "r")
    self.lm_layers = self.config["lm_layers"]
    self.lm_size = self.config["lm_size"]
    self.flat_ner = 'flat_ner' in self.config and self.config['flat_ner']
    self.use_crf = self.flat_ner and 'use_crf' in self.config and self.config['use_crf']

    self.eval_data = None  # Load eval data lazily.
    self.ner_types = self.config['ner_types']
    self.ner_maps = {ner: (i + 1) for i, ner in enumerate(self.ner_types)}
    self.num_types = len(self.ner_types)
    self.use_ffnn = 'max_mention_width' in self.config and self.config['max_mention_width'] > 0
    self.use_meaning = 'meaning_key' in self.config and self.config['meaning_key']
    self.use_meaning_as_feature = 'use_meaning_as_feature' in self.config and self.config['use_meaning_as_feature']
    if self.use_meaning:
      self.meaning_key = self.config['meaning_key']
      self.meaning_json = self.config['meaning_path']

    input_props = []
    input_props.append((tf.string, [None, None]))  # Tokens.
    input_props.append((tf.float32, [None, None, self.context_embeddings_size]))  # Context embeddings.
    input_props.append((tf.float32, [None, None, self.lm_size, self.lm_layers]))  # LM embeddings.
    input_props.append((tf.int32, [None, None, None]))  # Character indices.
    input_props.append((tf.int32, [None]))  # Text lengths.
    input_props.append((tf.bool, []))  # Is training.
    input_props.append((tf.int32, [None]))  # Gold NER Label
    input_props.append((tf.float32,[self.num_types+1,self.lm_size]))# tag meaning

    self.queue_input_tensors = [tf.placeholder(dtype, shape) for dtype, shape in input_props]
    dtypes, shapes = zip(*input_props)
    queue = tf.PaddingFIFOQueue(capacity=10, dtypes=dtypes, shapes=shapes)
    self.enqueue_op = queue.enqueue(self.queue_input_tensors)
    self.input_tensors = queue.dequeue()

    self.predictions, self.loss = self.get_predictions_and_loss(self.input_tensors)
    self.global_step = tf.Variable(0, name="global_step", trainable=False)
    self.reset_global_step = tf.assign(self.global_step, 0)
    learning_rate = tf.train.exponential_decay(self.config["learning_rate"], self.global_step,
                                               self.config["decay_frequency"], self.config["decay_rate"],
                                               staircase=True)
    trainable_params = tf.trainable_variables()
    gradients = tf.gradients(self.loss, trainable_params)
    gradients, _ = tf.clip_by_global_norm(gradients, self.config["max_gradient_norm"])
    optimizers = {
      "adam": tf.train.AdamOptimizer,
      "sgd": tf.train.GradientDescentOptimizer
    }
    optimizer = optimizers[self.config["optimizer"]](learning_rate)
    self.train_op = optimizer.apply_gradients(zip(gradients, trainable_params), global_step=self.global_step)

  def start_enqueue_thread(self, session):
    with open(self.config["train_path"]) as f:
      train_examples = [json.loads(jsonline) for jsonline in f.readlines()]

    def _enqueue_loop():
      while True:
        random.shuffle(train_examples)
        for example in train_examples:
          tensorized_example = self.tensorize_example(example, is_training=True)
          feed_dict = dict(zip(self.queue_input_tensors, tensorized_example))
          session.run(self.enqueue_op, feed_dict=feed_dict)
    enqueue_thread = threading.Thread(target=_enqueue_loop)
    enqueue_thread.daemon = True
    enqueue_thread.start()

  def restore(self, session):
    # Don't try to restore unused variables from the TF-Hub ELMo module.
    vars_to_restore = [v for v in tf.global_variables() if "module/" not in v.name]
    saver = tf.train.Saver(vars_to_restore)
    checkpoint_path = os.path.join(self.config["log_dir"], "model.max.ckpt")
    print("Restoring from {}".format(checkpoint_path))
    session.run(tf.global_variables_initializer())
    saver.restore(session, checkpoint_path)

  def load_lm_embeddings(self, doc_key):
    if self.lm_file is None:
      return np.zeros([0, 0, self.lm_size, self.lm_layers])
    file_key = doc_key.replace("/", ":")
    if not file_key in self.lm_file and file_key[:-2] in self.lm_file:
      file_key = file_key[:-2]
    group = self.lm_file[file_key]
    num_sentences = len(list(group.keys()))
    sentences = [group[str(i)][...] for i in range(num_sentences)]
    lm_emb = np.zeros([num_sentences, max(s.shape[0] for s in sentences), self.lm_size, self.lm_layers])
    for i, s in enumerate(sentences):
      lm_emb[i, :s.shape[0], :, :] = s
    return lm_emb

  def tensorize_example(self, example, is_training):
    ners = example["ners"]
    sentences = example["sentences"]

    max_sentence_length = max(len(s) for s in sentences)
    max_word_length = max(max(max(len(w) for w in s) for s in sentences), max(self.config["filter_widths"]))
    text_len = np.array([len(s) for s in sentences])
    tokens = [[""] * max_sentence_length for _ in sentences]
    char_index = np.zeros([len(sentences), max_sentence_length, max_word_length])
    context_word_emb = np.zeros([len(sentences), max_sentence_length, self.context_embeddings_size])
    lemmas = []
    if "lemmas" in example:
      lemmas = example["lemmas"]
    for i, sentence in enumerate(sentences):
      for j, word in enumerate(sentence):
        tokens[i][j] = word
        if self.context_embeddings.is_in_embeddings(word):
          context_word_emb[i, j] = self.context_embeddings[word]
        elif lemmas and self.context_embeddings.is_in_embeddings(lemmas[i][j]):
          context_word_emb[i,j] = self.context_embeddings[lemmas[i][j]]
        char_index[i, j, :len(word)] = [self.char_dict[c] for c in word]

    tokens = np.array(tokens)

    doc_key = example["doc_key"]

    lm_emb = self.load_lm_embeddings(doc_key)

    gold_labels = []
    if is_training:
      if self.use_crf:
        gold_labels = np.zeros([len(sentences),max_sentence_length],dtype=np.int32)
        for sid,sent in enumerate(sentences):
          ner = {(s, e): self.ner_maps[t] for s, e, t in ners[sid]}
          for s,e in ner:
            lb = ner[(s,e)]
            li = lb+self.num_types
            gold_labels[sid,s] = lb
            for i in xrange(s+1,e+1):
              gold_labels[sid,i] = li
        gold_labels = np.reshape(gold_labels,[len(sentences)*max_sentence_length])
      elif self.use_ffnn:
        for sid,sent in enumerate(sentences):
          ner = {(s,e):self.ner_maps[t] for s,e,t in ners[sid]}
          for s in xrange(len(sent)):
            for e in xrange(s, min(len(sent),s+self.config['max_mention_width'])):
              gold_labels.append(ner.get((s,e),0))
        gold_labels = np.array(gold_labels)
      else:
        for sid, sent in enumerate(sentences):
          ner = {(s,e):self.ner_maps[t] for s,e,t in ners[sid]}
          for s in xrange(len(sent)):
            for e in xrange(s,len(sent)):
              gold_labels.append(ner.get((s,e),0))
        gold_labels = np.array(gold_labels)
    else:
      gold_labels = np.array(gold_labels)

    ner_meaning = np.zeros((self.num_types+1,self.lm_size))
    if self.use_meaning:
      meaning_map = {}
      for line in open(self.meaning_json):
        doc = json.loads(line)
        if doc['doc_key'] == self.meaning_key:
          labels = doc['labels']
          for i, value in enumerate(doc['values']):
            meaning_map[labels[i]] = value
          break
      assert len(meaning_map) == self.num_types
      for label, i in self.ner_maps.items():
        ner_meaning[i] = meaning_map[label]

    example_tensors = (tokens, context_word_emb,lm_emb, char_index, text_len, is_training, gold_labels,ner_meaning)

    return example_tensors

  def get_dropout(self, dropout_rate, is_training):
    return 1 - (tf.to_float(is_training) * dropout_rate)

  def lstm_contextualize(self, text_emb, text_len, lstm_dropout):
    num_sentences = tf.shape(text_emb)[0]

    current_inputs = text_emb  # [num_sentences, max_sentence_length, emb]
    for layer in range(self.config["contextualization_layers"]):
      with tf.variable_scope("layer_{}".format(layer), reuse=tf.AUTO_REUSE):
        with tf.variable_scope("fw_cell"):
          cell_fw = util.CustomLSTMCell(self.config["contextualization_size"], num_sentences, lstm_dropout)
        with tf.variable_scope("bw_cell"):
          cell_bw = util.CustomLSTMCell(self.config["contextualization_size"], num_sentences, lstm_dropout)
        state_fw = tf.contrib.rnn.LSTMStateTuple(tf.tile(cell_fw.initial_state.c, [num_sentences, 1]),
                                                 tf.tile(cell_fw.initial_state.h, [num_sentences, 1]))
        state_bw = tf.contrib.rnn.LSTMStateTuple(tf.tile(cell_bw.initial_state.c, [num_sentences, 1]),
                                                 tf.tile(cell_bw.initial_state.h, [num_sentences, 1]))

        (fw_outputs, bw_outputs), ((_, fw_final_state), (_, bw_final_state)) = tf.nn.bidirectional_dynamic_rnn(
          cell_fw=cell_fw,
          cell_bw=cell_bw,
          inputs=current_inputs,
          sequence_length=text_len,
          initial_state_fw=state_fw,
          initial_state_bw=state_bw)

        text_outputs = tf.concat([fw_outputs, bw_outputs], 2)  # [num_sentences, max_sentence_length, emb]
        text_outputs = tf.nn.dropout(text_outputs, lstm_dropout)
        if layer > 0:
          highway_gates = tf.sigmoid(
            util.projection(text_outputs, util.shape(text_outputs, 2)))  # [num_sentences, max_sentence_length, emb]
          text_outputs = highway_gates * text_outputs + (1 - highway_gates) * current_inputs
        current_inputs = text_outputs

    return text_outputs

  def get_predictions_and_loss(self, inputs):
    tokens, context_word_emb, lm_emb, char_index, text_len, is_training, gold_labels,ner_meaning = inputs
    self.dropout = self.get_dropout(self.config["dropout_rate"], is_training)
    self.lexical_dropout = self.get_dropout(self.config["lexical_dropout_rate"], is_training)
    self.lstm_dropout = self.get_dropout(self.config["lstm_dropout_rate"], is_training)

    num_sentences = tf.shape(tokens)[0]
    max_sentence_length = tf.shape(tokens)[1]
    num_tokens = tf.reduce_sum(text_len)

    context_emb_list = []
    if 'without_fasttext' in self.config and self.config["without_fasttext"]:
      print('----------not using fasttext embeddings')
    else:
      context_emb_list.append(context_word_emb)

    if 'without_char' in self.config and self.config['without_char']:
      print('----------not using char embeddings')
    else:
      char_emb = tf.gather(tf.get_variable("char_embeddings", [len(self.char_dict), self.config["char_embedding_size"]]), char_index) # [num_sentences, max_sentence_length, max_word_length, emb]
      flattened_char_emb = tf.reshape(char_emb, [num_sentences * max_sentence_length, util.shape(char_emb, 2), util.shape(char_emb, 3)]) # [num_sentences * max_sentence_length, max_word_length, emb]
      flattened_aggregated_char_emb = util.cnn(flattened_char_emb, self.config["filter_widths"], self.config["filter_size"]) # [num_sentences * max_sentence_length, emb]
      aggregated_char_emb = tf.reshape(flattened_aggregated_char_emb, [num_sentences, max_sentence_length, util.shape(flattened_aggregated_char_emb, 1)]) # [num_sentences, max_sentence_length, emb]
      context_emb_list.append(aggregated_char_emb)

    if 'without_bert' in self.config and self.config["without_bert"]:
      print('----------not using BERT')
    else:
      lm_emb_size = util.shape(lm_emb, 2)
      lm_num_layers = util.shape(lm_emb, 3)
      with tf.variable_scope("lm_aggregation"):
        self.lm_weights = tf.nn.softmax(tf.get_variable("lm_scores", [lm_num_layers], initializer=tf.constant_initializer(0.0)))
        self.lm_scaling = tf.get_variable("lm_scaling", [], initializer=tf.constant_initializer(1.0))

      flattened_lm_emb = tf.reshape(lm_emb, [num_sentences * max_sentence_length * lm_emb_size, lm_num_layers])
      flattened_aggregated_lm_emb = tf.matmul(flattened_lm_emb, tf.expand_dims(self.lm_weights, 1)) # [num_sentences * max_sentence_length * emb, 1]
      aggregated_lm_emb = tf.reshape(flattened_aggregated_lm_emb, [num_sentences, max_sentence_length, lm_emb_size])
      aggregated_lm_emb *= self.lm_scaling
      context_emb_list.append(aggregated_lm_emb)

    context_emb = tf.concat(context_emb_list, 2) # [num_sentences, max_sentence_length, emb]
    context_emb = tf.nn.dropout(context_emb, self.lexical_dropout) # [num_sentences, max_sentence_length, emb]

    text_len_mask = tf.sequence_mask(text_len, maxlen=max_sentence_length) # [num_sentence, max_sentence_length]

    candidate_scores_mask = tf.logical_and(tf.expand_dims(text_len_mask,[1]),tf.expand_dims(text_len_mask,[2])) #[num_sentence, max_sentence_length,max_sentence_length]
    sentence_ends_leq_starts = tf.tile(tf.expand_dims(tf.logical_not(tf.sequence_mask(tf.range(max_sentence_length),max_sentence_length)), 0),[num_sentences,1,1]) #[num_sentence, max_sentence_length,max_sentence_length]
    candidate_scores_mask = tf.logical_and(candidate_scores_mask,sentence_ends_leq_starts)

    flattened_candidate_scores_mask = tf.reshape(candidate_scores_mask,[-1]) #[num_sentence * max_sentence_length * max_sentence_length]

    context_outputs = self.lstm_contextualize(context_emb, text_len,self.lstm_dropout) # [num_sentence, max_sentence_length, emb]

    if self.use_crf:
      print('--------use crf')
      gold_labels = tf.reshape(gold_labels,[num_sentences,max_sentence_length])
      logits = util.projection(context_outputs,self.num_types*2+1)
      log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(logits,gold_labels,text_len)
      loss = tf.reduce_mean(-log_likelihood)
      candidate_ner_labels, viterbi_score = tf.contrib.crf.crf_decode(logits, transition_params,text_len)
      return candidate_ner_labels, loss
    else:
      print('---------use biaffine')
      with tf.variable_scope("candidate_starts_ffnn"):
        candidate_starts_emb = util.projection(context_outputs,self.config["ffnn_size"]) #[num_sentences, max_sentences_length,emb]
      with tf.variable_scope("candidate_ends_ffnn"):
        candidate_ends_emb = util.projection(context_outputs,self.config["ffnn_size"]) #[num_sentences, max_sentences_length, emb]

      candidate_ner_scores = util.bilinear_classifier(candidate_starts_emb,candidate_ends_emb,self.dropout,output_size=self.num_types+1,use_meaning = self.use_meaning,use_meaning_only=self.use_meaning_as_feature, meaning=ner_meaning)#[num_sentence, max_sentence_length,max_sentence_length,types+1]
      candidate_ner_scores = tf.boolean_mask(tf.reshape(candidate_ner_scores,[-1,self.num_types+1]),flattened_candidate_scores_mask)

      loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=gold_labels, logits=candidate_ner_scores)
      loss = tf.reduce_sum(loss)

      return candidate_ner_scores, loss

  def get_crf_ner(self,sentences,candidate_labels):
    pred_mentions = set()
    for sid, sent in enumerate(sentences):
      pre = 0
      start = -1
      for i in xrange(len(sent)):
        l = candidate_labels[sid,i]
        li = l if l <=self.num_types else l-self.num_types
        if li != pre or l <= self.num_types: #change of the label or B- labels
          if pre > 0:
            assert start >=0
            pred_mentions.add((sid,start,i-1,pre))
          pre = li
          start = i if li > 0 else -1
    return pred_mentions

  def get_pred_ner(self, sentences, span_scores, is_flat_ner):
    candidates = []
    for sid,sent in enumerate(sentences):
      for s in xrange(len(sent)):
        stop_ind = min(len(sent),s+self.config["max_mention_width"]) if self.use_ffnn else len(sent)
        for e in xrange(s,stop_ind):
          candidates.append((sid,s,e))

    top_spans = [[] for _ in xrange(len(sentences))]
    for i, type in enumerate(np.argmax(span_scores,axis=1)):
      if type > 0:
        sid, s,e = candidates[i]
        top_spans[sid].append((s,e,type,span_scores[i,type]))

    top_spans = [sorted(top_span,reverse=True,key=lambda x:x[3]) for top_span in top_spans]
    sent_pred_mentions = [[] for _ in xrange(len(sentences))]
    for sid, top_span in enumerate(top_spans):
      for ns,ne,t,_ in top_span:
        for ts,te,_ in sent_pred_mentions[sid]:
          if ns < ts <= ne < te or ts < ns <= te < ne:
            #for both nested and flat ner no clash is allowed
            break
          if is_flat_ner and (ns <= ts <= te <= ne or ts <= ns <= ne <= te):
            #for flat ner nested mentions are not allowed
            break
        else:
          sent_pred_mentions[sid].append((ns,ne,t))
    pred_mentions = set((sid,s,e,t) for sid, spr in enumerate(sent_pred_mentions) for s,e,t in spr)
    return pred_mentions

  def load_eval_data(self):
    if self.eval_data is None:
      def load_line(line):
        example = json.loads(line)
        return self.tensorize_example(example, is_training=False), example

      with open(self.config["eval_path"]) as f:
        self.eval_data = [load_line(l) for l in f.readlines()]

      print("Loaded {} eval examples.".format(len(self.eval_data)))

  def evaluate(self, session, is_final_test=False):
    self.load_eval_data()

    tp,fn,fp = 0,0,0
    start_time = time.time()
    num_words = 0
    sub_tp,sub_fn,sub_fp = [0] * self.num_types,[0]*self.num_types, [0]*self.num_types

    is_flat_ner = 'flat_ner' in self.config and self.config['flat_ner']

    for example_num, (tensorized_example, example) in enumerate(self.eval_data):
      feed_dict = {i:t for i,t in zip(self.input_tensors, tensorized_example)}
      candidate_ner_scores = session.run(self.predictions, feed_dict=feed_dict)

      num_words += sum(len(tok) for tok in example["sentences"])

      gold_ners = set([(sid,s,e, self.ner_maps[t]) for sid, ner in enumerate(example['ners']) for s,e,t in ner])
      if self.use_crf:
        pred_ners  = self.get_crf_ner(example["sentences"],candidate_ner_scores)
      else:
        pred_ners = self.get_pred_ner(example["sentences"], candidate_ner_scores,is_flat_ner)

      tp += len(gold_ners & pred_ners)
      fn += len(gold_ners - pred_ners)
      fp += len(pred_ners - gold_ners)

      if is_final_test:
        for i in xrange(self.num_types):
          sub_gm = set((sid,s,e) for sid,s,e,t in gold_ners if t ==i+1)
          sub_pm = set((sid,s,e) for sid,s,e,t in pred_ners if t == i+1)
          sub_tp[i] += len(sub_gm & sub_pm)
          sub_fn[i] += len(sub_gm - sub_pm)
          sub_fp[i] += len(sub_pm - sub_gm)

      if example_num % 10 == 0:
        print("Evaluated {}/{} examples.".format(example_num + 1, len(self.eval_data)))

    used_time = time.time() - start_time
    print("Time used: %d second, %.2f w/s " % (used_time, num_words*1.0/used_time))

    m_r = 0 if tp == 0 else float(tp)/(tp+fn)
    m_p = 0 if tp == 0 else float(tp)/(tp+fp)
    m_f1 = 0 if m_p == 0 else 2.0*m_r*m_p/(m_r+m_p)

    print("Mention F1: {:.2f}%".format(m_f1*100))
    print("Mention recall: {:.2f}%".format(m_r*100))
    print("Mention precision: {:.2f}%".format(m_p*100))
    print("{:.1f}&{:.1f}&{:.1f}".format(m_p*100,m_r*100,m_f1*100))

    if is_final_test:
      print("****************SUB NER TYPES********************")
      for i in xrange(self.num_types):
        sub_r = 0 if sub_tp == 0 else float(sub_tp[i]) / (sub_tp[i] + sub_fn[i])
        sub_p = 0 if sub_tp == 0 else float(sub_tp[i]) / (sub_tp[i] + sub_fp[i])
        sub_f1 = 0 if sub_p == 0 else 2.0 * sub_r * sub_p / (sub_r + sub_p)

        print("{} F1: {:.2f}%".format(self.ner_types[i],sub_f1 * 100))
        print("{} recall: {:.2f}%".format(self.ner_types[i],sub_r * 100))
        print("{} precision: {:.2f}%".format(self.ner_types[i],sub_p * 100))

    summary_dict = {}
    summary_dict["Mention F1"] = m_f1
    summary_dict["Mention recall"] = m_r
    summary_dict["Mention precision"] = m_p

    return util.make_summary(summary_dict), m_f1

wangxinyu0922 commented 4 years ago

I tried a quick experiment on fastText+char embeddings on both graph-based sequence labeling and sequence-based sequence labeling without tuning the hyper-parameters. And I got 87.97 F1 score with graph-based and 90.03 with sequence-based. Do you have any suggestions about the hyper-parameters to get a better results for graph-based method?

Thank you for the code and I'm sorry for the limited GPU resources for you. For sequence labeling NER, I can run on my own code for sequanece labeling. But I'm not sure I can train a good graph-parsing NER model without contextual embeddings. Since the results I posted above showed inferior performance for graph-parsing NER to the sequence labeling NER.

juntaoy commented 4 years ago

I see what you mean, did you follow the config in the experiments.conf the eng_conll03 one? train on the train+dev set for 80k steps? I would expect a 90+ for biaffine as well since for ontonotes it drops 2.4 by removing BERT. a 5% drop is a bit too much:) So since we don't use a dev set here you might want to try few different settings about the max_step say 20k 40k etc. As without bert you have fewer parameters so it might be better to use a smaller training step. Also, you could try to use some earlier stop, e.g. if the training loss does not drop for few epoch you terminate earlier.

juntaoy commented 4 years ago

I've done the CoNLL 03 English experiment using biaffine without BERT and use same 80k as for the one with BERT, I got a 90.7% F1, not sure why you got a much lower result:(, the ontonotes experiments are still running will let you known once I got the results.

wangxinyu0922 commented 4 years ago

I rerun the experiment of biaffine parser on CoNLL 03 English, and got 91.41 F1 on average in two trials. The score is 0.7 higher than yours and the key point for the improvement is that I use a batch size of 32 rather than the document level batch. In comparison, I ran BiLSTM-CRF structure with the same word and char embeddings and got 91.67 F1 on average. So I still cannot prove that without the BERT embeddings, the biaffine NER is stronger than CRF NER. Do you have any suggestions?

juntaoy commented 4 years ago

I see not sure about the CoNLL 03, the picture is however very clear on Ontonotes (more complex task and more large data) without BERT CRF got 87.93 and biaffine got 88.72, 0.79 diff with sentence-level BERT CRF got 88.81 and biaffine got 89.74, 0.93 diff with document-level BERT CRF got 90.18 vs 91.29 for biaffine, 1.11 diff

I think this might because the CoNLL 03 is simple enough to be solved by CRF so biaffine does not gain any improvement. But one thing you might want to try is to use undersampling as biaffine gets way more negative examples than CRF. You can find the detail of how to do undersampling via masks from the this link

wangxinyu0922 commented 4 years ago

I run the CRF model on OntoNote without BERT (to double check the consistency of dataset, my ontonote has 59924 train + 8528 dev + 8262 test), and got 88.3 F1 score averaged over 3 times, which show that biaffine parser is stronger than the CRF model, but not as stronger as you reported. I think the problem is that in you experiment, you use 3 layer bilstm + Adam for training CRF, but the most usual settings for CRF are single layer bilstm + SGD (note that I pointed this out before) (Yang et al., 2018). I'm still running sentence-level BERT for further comparison.

In general, now I believe that the biaffine-ner is better than CRF models. However, a lot of experiments in the paper make me confused and possibly are not a fair comparison with previous work:

document-level BERT embeddings VS. previous work with sentence-level BERT embeddings
Not proper ablation study for the CRF model (Adam + 3 layer BiLSTM VS. SGD + 1 layer BiLSTM)
Possibly inferior performance on simpler flat NER on CoNLL 02/03 datasets (As I have proved on CoNLL 03)

Anyway, I'm glad to see that the ner as parsing works on several task despite with some exceptions. I can try more experiments on this topic now.

juntaoy commented 4 years ago

I see, so it does get a bit more if you use 1 layer bilstm + SGD. However, don't forget the main point of this paper is not only on the flat NER but also nest ones. The bilstm-crf model doesn't support nest NER (at least not good at it :)) Also to do extensive parameter tuning is not the goal of this paper, we didn't do any parameter tuning on any dataset, we took most parameters directly from the coref system Kantor and Globerson (2019). So if you do a grid search on CoNLL03 I am sure you will get better results.

To answer your questions:

Firstly, the original BERT paper also use the document level embedding for NER, in section 5.3 it written "In this section, we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang and De Meulder, 2003). In the input to BERT, we use a case-preserving WordPiece model, and we include the maximal document context provided by the data. Following standard practice, we formulate this as a tagging task but do not use a CRF".

Secondly, before the contextual embedding age Strubell et al (2017) already use the CNN over document level and point out that the document-level info is helpful.

Thirdly, if you think to use document-level info is not a fair comparison, how about the Strakova et al (2019), they use three contextual embeddings (BERT, ELMO and Flair), not fair according to your judgement?

The ablation study is designed to understand the individual component of the system if you change other parameters it is not an ablation anymore, and you will never sure where the difference comes from. It is clear in the paper I only changed the biaffine to crf and I only claim in the current setting biaffine is better than the crf. If you want to do the extensive comparison you might want to do something like Yang et al 2018:), which itself is a very good paper and this is far beyond the scope of our paper. Also, the reason your crf system has a better performance might not only because the use of SGD and 1 layer, other parameters might also different. There is another paper mentioned in Yang et al (2018): Reimers and Gurevych (2017), they observed a totally different story from Yang et al (2018), they find Adam is much better than SGD and 3 stocked LSTM worked equally well than the 1 layer version and to use 2 layers is the best:) In fact not all the people use SGD for CRF enabled architectural, e.g. Strubell et al (2017) choose Adam after doing the grid search.
The so-called possible inferior performance is when you evaluated without BERT and also without grid search for biaffine. As mentioned in the paper our results on CoNLL03 English is only slightly better (0.1) than previous work, so we are not misleading people by saying our biaffine is much better on that particular dataset. One reason why our system didn't outperform the other system on that dataset might because many systems are tuned on CoNLL03 English data but we didn't.

wangxinyu0922 commented 4 years ago

Certainly, I always believe biaffine parser for nested NER is very great, but I need to make sure the flatted NER work as well. So I mainly talk about flat NER here. For your comments:

For the document-level and sentence-level, I think they are probably different topics and are practical. If you want to tag on documents, certainly taking more sentences will become better and it is a proper way to do this. For sentence-level, the most usual usage is online serving for the users (e.g. search engine, inputs from customers), where the inputs are mostly single sentences. Using either sentence-level or document-level are good for study, but I think it should by more clear in the paper since document-level and sentence-level are applicable since they are different kinds of information/input (multiple sentence input [w_1, ... , w_n] VS. single sentence input [w_i]).

For the embeddings, I think you can also concat more embeddings as well (though, I think contextual character embeddings may not help parsing methods according to my experiments) or simply use BERT embeddings only. Since the embeddings has different advantages over the approach (Flair is significantly stronger than BERT for sequence-labeling based approaches (Akbik et al., 2018). And in fact, the BERT embedding for sequence-labeling based can do the document-level as well. For me, I want to see a fair comparison between previous state-of-the-art (https://www.aclweb.org/anthology/P19-1527/) and biaffine parser. The beneficial of embeddings probably depends on you network.

Therefore compared to the embeddings issue and document-/sentence-level issue, latter is more important I think. The embeddings only change you network, but not inputs. Though, this is a problem of a lot of previous work, including the great work of BERT. Again, since we are comparing the model architecture, I need to have a fair comparison for the two (decoding) approaches on flat NER with same kinds of input, same kinds of output and even the same kinds of embeddings (though this is not essential). By the way, if you think different embeddings in the comparison is unfair, in addition to the input style, the comparison becomes more unfair :)

It may be a reason but I have proved that single-layer BiLSTM+SGD does better than yours here. So the sequence-labeling approach can do better in the comparison and it is a usual approach recently.
The hyper-parameters are the same in most of tasks in sequence labeling. For the biaffine parser, your hyper-parameters will be good if we directly apply it into the dependency parsing datasets like PTB. So I think tuning hyper-parameters may not affect the accuracy significantly (also for the grid search). Though the biaffine parser improve on CoNLL 03 NER only by 0.1, we will believe this is a new state-of-the-art based on the score. So we must take the score more seriously and carefully in the comparison.

In conclusion, the document-/sentence-level are totally different input for the task and we need to figure it out clearly, while the embeddings only affect the model architecture. Though it is a problem from a lot of previous work, I need to clearify it since I want to have a fair settings/comparisons (which I believe) in my work. In fact, I don't want to see the biaffine parser does not work on NER tasks because I cannot do more work based on it if works bad. So currently I believe it does good at difficult flat NER tasks (OntoNote) and nested ner tasks.

Thank you for the help to understand your great work more clearly and again I'm very glad to see it works well.

juntaoy commented 4 years ago

Thanks for point out that your focus is actually more practical use of the system, I now understand why you think they are so different. I was only focused on the dataset itself, so on CoNLL03 dataset since the document info is included so I was convinced that we could use document-level info without making it unfair to compare with previous work. And apparently you are more concerned with the practical point of view. You are certainly right, the application of sentence-level NER and document-level NER is different. So if you want to use it in things like search engine then you can only do sentence-level. Also due to the large number of predictions required by such application, you might don't want to use BERT at all, as it is expensive to compute.

For flair, as you might notice in my paper, the very impressive German result they reported is a bit misleading, they used a different version of the data, so it is not comparable to the previous work. Oh, in fact, you point out that questions :)

By the way, I double-checked the corpus we used for OntoNotes, we actually didn't remove the NT part (the one without NER annotation) from train/dev/test, as most of the work we compare with didn't specifically mention they removed them, so this has a negative impact on our result (not too large). So I train the CRF on the same version train/dev/test set as yours the system achieved 88.22 very close to your 1 layer + SDG settings. So the score for biaffine also improved to 89.0 the benefit of the biaffine is 0.78. Sorry, I don't have the resource to run it multiple times, but the score difference of different run on OntoNotes is very small, so I am certain you will get at least 0.5 out of biaffine.

By the way, can you send me your configuration for bilstm-crf you used for CoNLL03, I want to double-check see if the same parameter also helps my system, as you can see on OntoNotes the score different between 1-layer SDG and 3layer adam is very small. But they do show a large difference in CoNLL03. And for CRF did you only changed the batch size to make it 90.03 --> 91.67? or did you change something else as well? Also, do you train with train+dev or just train and use dev to select models via training? Can you also let me known how do you choose the best model.

wangxinyu0922 commented 4 years ago

I mainly follow the design of (Yang et al., 2018). The hyper-parameters I think important are：

Hyper-parameter	Value
Batch Size	32 sentence/batch
BiLSTM layers	1
BiLSTM input hiddens	256
Word Embedding	fastText 300d (freezed)
Char embedding	Char BiLSTM 25d
Optimizer	SGD
Learning rate	0.1
Train with dev	True
Learning rate decay	*0.5 if 10 epoches no improvement

The model is chosen from the best dev score.

juntaoy commented 4 years ago

Thanks a lot:)

speedcell4 commented 4 years ago

Hi, thank you for sharing the source code.

I am also curious about the influence of the document-level BERT, could you please share me with your document-level BERT embedding files? Since I am not familiar with Tensorflow, running your extract_bert_features.sh script is hard for me. I need both CoNLL 2003 and OntoNotes 5.0 English.

Thank you~

juntaoy commented 4 years ago

Hi @speedcell4

The BERT embedding files are large, the ontonotes one along already have 27GB, I am afraid you will need to set up the BERT and generate yourself :(

Best,

Juntao

speedcell4 commented 4 years ago

@juntaoy Thanks for your reply. I found there is a doc_key in your jsonlines format example, but what is that? Should I just group all the sentences in the document with the same doc_key?

juntaoy commented 4 years ago

@speedcell4 Yes the same document goes under the same jsonline hence share the same doc_key

speedcell4 commented 4 years ago

@juntaoy But for OntoNotes 5.0, there is not -DOCSTART- as document indicator, how do you handle this?

juntaoy commented 4 years ago

@speedcell4 the ontonotes are based on documents so the path to the documents will be the doc_key. for simplicity you can follow: https://github.com/kentonl/e2e-coref/blob/master/setup_training.sh and https://github.com/kentonl/e2e-coref/blob/master/minimize.py to create the json files. You will only need to modify the NER to sentences level.

zhaoxf4 commented 4 years ago

@juntaoy sorry to bother you, I find just a few part of ace2004 have end flags like "( End )" and "---". Did you have other flags to divide it?

juntaoy commented 4 years ago

@zhaoxf4 I converted the corpus using code from Dan Roth's group. The code can be found here: https://github.com/CogComp/cogcomp-nlp/tree/master/corpusreaders

zhaoxf4 commented 4 years ago

@juntaoy thank you very much! amazing response speed!

juntaoy / biaffine-ner

Bert Embedding Extraction #8