Replace Luke with MLuke in Notebook/ConLL-2003

mrpeerat commented 2 years ago

Hi!

I'm trying to run MLuke on https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb by replacing studio-ousia/luke-large-finetuned-conll-2003 with studio-ousia/mluke-large-lite-finetuned-conll-2003 and changing LukeTokenizer to MLukeTokenizer. Every thing looks find until the block:

batch_size = 2 all_logits = []

for batch_start_idx in trange(0, len(test_examples), batch_size): batch_examples = test_examples[batch_start_idx:batch_start_idx + batch_size] texts = [example["text"] for example in batch_examples] entity_spans = [example["entity_spans"] for example in batch_examples]
inputs = tokenizer(texts, entity_spans=entity_spans, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
with torch.no_grad():
    outputs = model(**inputs)
all_logits.extend(outputs.logits.tolist())

The error is

AttributeError Traceback (most recent call last) Cell In [8], line 12 10 inputs = inputs.to("cuda") 11 with torch.no_grad(): ---> 12 outputs = model(**inputs) 13 all_logits.extend(outputs.logits.tolist())

File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, *kwargs) 1098 # If we don't have any hooks, we want to skip the rest of the logic in 1099 # this function, and just call forward. 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1101 or _global_forward_hooks or _global_forward_pre_hooks): -> 1102 return forward_call(input, **kwargs) 1103 # Do not call functions when jit is used 1104 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/transformers/models/luke/modeling_luke.py:1588, in LukeForEntitySpanClassification.forward(self, input_ids, attention_mask, token_type_ids, position_ids, entity_ids, entity_attention_mask, entity_token_type_ids, entity_position_ids, entity_start_positions, entity_end_positions, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict) 1571 outputs = self.luke( 1572 input_ids=input_ids, 1573 attention_mask=attention_mask, (...) 1584 return_dict=True, 1585 ) 1586 hidden_size = outputs.last_hidden_state.size(-1) -> 1588 entity_start_positions = entity_start_positions.unsqueeze(-1).expand(-1, -1, hidden_size) 1589 start_states = torch.gather(outputs.last_hidden_state, -2, entity_start_positions) 1590 entity_end_positions = entity_end_positions.unsqueeze(-1).expand(-1, -1, hidden_size)

AttributeError: 'NoneType' object has no attribute 'unsqueeze'

Thank you.

ryokan0123 commented 2 years ago

Hi @mrpeerat, thank you for reporting the problem.

It seems that the problem can be solved by setting tokenizer.task = "entity_span_classification".

This should have been set by default when instantiating the tokenizer from the huggingface hub... We have fixed that, so it should be fine if you re-run the entire notebook. Or you could alleviate the issue by manually setting the task attribute as above.

mrpeerat commented 2 years ago

Hi again, the bug is fixed. Thank you for the suggestion @Ryou0634 . However, I found that some of mLUKE's predictions are missing. For example, I use Spanish CoNLL-2002 from https://www.kaggle.com/datasets/nltkdata/conll-corpora (esp.testa) Everything looks fine until the evaluation part. print(seqeval.metrics.classification_report([final_labels], [final_predictions], digits=4))

Errors:

ValueError Traceback (most recent call last) Cell In [36], line 1 ----> 1 print(seqeval.metrics.classification_report([final_labels], [final_predictions], digits=4))

File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/sequence_labeling.py:692, in classification_report(y_true, y_pred, digits, suffix, output_dict, mode, sample_weight, zero_division, scheme) 689 reporter = StringReporter(width=width, digits=digits) 691 # compute per-class scores. --> 692 p, r, f1, s = precision_recall_fscore_support( 693 y_true, y_pred, 694 average=None, 695 sample_weight=sample_weight, 696 zero_division=zero_division, 697 suffix=suffix 698 ) 699 for row in zip(target_names, p, r, f1, s): 700 reporter.write(*row)

File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/sequence_labeling.py:130, in precision_recall_fscore_support(y_true, y_pred, average, warn_for, beta, sample_weight, zero_division, suffix) 126 true_sum = np.append(true_sum, len(entities_true_type)) 128 return pred_sum, tp_sum, true_sum --> 130 precision, recall, f_score, true_sum = _precision_recall_fscore_support( 131 y_true, y_pred, 132 average=average, 133 warn_for=warn_for, 134 beta=beta, 135 sample_weight=sample_weight, 136 zero_division=zero_division, 137 scheme=None, 138 suffix=suffix, 139 extract_tp_actual_correct=extract_tp_actual_correct 140 ) 142 return precision, recall, f_score, true_sum

File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/v1.py:122, in _precision_recall_fscore_support(y_true, y_pred, average, warn_for, beta, sample_weight, zero_division, scheme, suffix, extract_tp_actual_correct) 119 if average not in average_options: 120 raise ValueError('average has to be one of {}'.format(average_options)) --> 122 check_consistent_length(y_true, y_pred) 124 pred_sum, tp_sum, true_sum = extract_tp_actual_correct(y_true, y_pred, suffix, scheme) 126 if average == 'micro':

File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/v1.py:101, in check_consistent_length(y_true, y_pred) 99 if len(y_true) != len(y_pred) or len_true != len_pred: 100 message = 'Found input variables with inconsistent numbers of samples:\n{}\n{}'.format(len_true, len_pred) --> 101 raise ValueError(message)

ValueError: Found input variables with inconsistent numbers of samples: [52923] [52911]

Thank you.

mrpeerat commented 2 years ago

Also, the performance of mLUKE on CoNLL-2003 (English) is dropped significantly from the paper (I use the eng.testb).

ryokan0123 commented 1 year ago

It seems the notebook is not compatible with mLUKE, because the notebook was created for LUKE and there have been significant updates in the codebase in the repository... (I guess that something is not compatible with MLukeTokenizer in preprocessing)

You may consider using examples/ner/evaluate_transformers_checkpoint.py to evaluate multilingual models.

chantera commented 1 year ago

I got the same results of mluke-large-lite-finetuned-conll-2003 on CoNLL-2003 (English) when using the notebook (while luke-large-finetuned-conll-2003 worked fine). Instead, I tried the reproduction using examples/ner/evaluate_transformers_checkpoint.py, but the performance degraded more. (luke-large-finetuned-conll-2003 using the script also did not work.) Could you give me some advice?

$ python examples/ner/evaluate_transformers_checkpoint.py data/conll-2003/eng.testb studio-ousia/mluke-large-lite-finetuned-conll-2003 --cuda-device 0                                                                        
Use the tokenizer: studio-ousia/mluke-large-lite
Use the model config: examples/ner/configs/lib/transformers_model_luke.jsonnet
[2023-06-01 19:13:25,548] [INFO] type = span_ner
[2023-06-01 19:13:25,549] [INFO] feature_extractor.type = token+entity
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.type = transformers-luke
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.model_name = studio-ousia/mluke-large-lite-finetuned-conll-2003
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.train_parameters = True
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.output_embeddings = token+entity
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.use_entity_aware_attention = False
Some weights of the model checkpoint at studio-ousia/mluke-large-lite-finetuned-conll-2003 were not used when initializing LukeModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing LukeModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LukeModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2023-06-01 19:13:31,430] [INFO] dropout = 0.1
[2023-06-01 19:13:31,430] [INFO] label_name_space = labels
[2023-06-01 19:13:31,430] [INFO] text_field_key = tokens
[2023-06-01 19:13:31,430] [INFO] prediction_save_path = None
loading instances: 5211it [00:17, 292.83it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 163/163 [03:10<00:00,  1.17s/it]
{'f1': 0.0024849130280440185, 'precision': 0.0012468827930174563, 'recall': 0.35, 'span_accuracy': 0.9846012216309538}

chantera commented 1 year ago

After editing examples/ner/evaluate_transformers_checkpoint.py to let ConllSpanReader follow iob_scheme="iob1", luke-large-finetuned-conll-2003 and mluke-large-lite-finetuned-conll-2003 gave 94.61 and 94.05 F1 scores, respectively. I will try to update the notebook for mLUKE so that I and other users can easily reproduce the results.

ryokan0123 commented 1 year ago

@chantera The dataset format was somewhat obscured within the dataset reader's option, and we should have made it more explicit. Thank you for bringing this to users' attention!

chantera commented 1 year ago

I finally figured out why the performances of mluke-large-lite-finetuned-conll-2003 differ in the notebook and examples/ner/evaluate_transformers_checkpoint.py.

For the evaluation script, TokenEntityNERFeatureExtractor prepares entity_attention_mask as follows (examples/ner/modules/token_and_entity.py#L31):

inputs["entity_attention_mask"] = entity_ids != 0

However, entity_ids are all 0 because entity_mask_token ("[MASK]") 's id is 0 when using the Luke/MLuke tokenizer. In addition, entity_ids are padded with 0 for batch computation (examples/ner/reader.py#L206). As a result, entity_attention_mask are all False; thus, attention weights would not be appropriately calculated.

To ensure that this causes the discrepancy, I added the following code in the notebook.

inputs = tokenizer(texts, entity_spans=entity_spans, return_tensors="pt", padding=True)
###
inputs["entity_attention_mask"] = torch.zeros_like(inputs["entity_attention_mask"])
###
inputs = inputs.to("cuda")
with torch.no_grad():
    outputs = model(**inputs)
all_logits.extend(outputs.logits.tolist())

This worked and gave 94.23 F1 score.

I will send a PR to fix this soon.

chantera commented 1 year ago

I guess that something is not compatible with MLukeTokenizer in preprocessing

Note that the preprocessing in the notebook did not cause the problem. I adapted it for the reader as follows and confirmed that this achieved a similar performance:

def data_to_instance(self, words: List[str], labels: List[str], sentence_boundaries: List[int], doc_index: str):
    subword_lengths = [len(self.tokenizer.tokenize(w)) for w in words]
    total_subword_length = sum(subword_lengths)
    max_token_length = self.max_num_subwords
    max_mention_length = self.max_mention_length

    entities = {}
    for s, e in zip(sentence_boundaries[:-1], sentence_boundaries[1:]):
        for ent in Entities([labels[s:e]], scheme=self.iob_scheme).entities[0]:
            entities[(ent.start + s, ent.end + s)] = ent.tag

    for i in range(len(sentence_boundaries) - 1):
        sentence_start, sentence_end = sentence_boundaries[i:i+2]
        if total_subword_length <= max_token_length:
            context_start = 0
            context_end = len(words)
        else:
            context_start = sentence_start
            context_end = sentence_end
            cur_length = sum(subword_lengths[context_start:context_end])
            while True:
                if context_start > 0:
                    if cur_length + subword_lengths[context_start - 1] <= max_token_length:
                        cur_length += subword_lengths[context_start - 1]
                        context_start -= 1
                    else:
                        break
                if context_end < len(words):
                    if cur_length + subword_lengths[context_end] <= max_token_length:
                        cur_length += subword_lengths[context_end]
                        context_end += 1
                    else:
                        break

        text = ""
        for word in words[context_start:sentence_start]:
            # if word[0] == "'" or (len(word) == 1 and is_punctuation(word)):
            #     text = text.rstrip()
            text += word
            text += " "

        sentence_words = words[sentence_start:sentence_end]
        sentence_subword_lengths = subword_lengths[sentence_start:sentence_end]

        word_start_char_positions = []
        word_end_char_positions = []
        for word in sentence_words:
            # if word[0] == "'" or (len(word) == 1 and is_punctuation(word)):
            #     text = text.rstrip()
            word_start_char_positions.append(len(text))
            text += word
            word_end_char_positions.append(len(text))
            text += " "

        for word in words[sentence_end:context_end]:
            # if word[0] == "'" or (len(word) == 1 and is_punctuation(word)):
            #     text = text.rstrip()
            text += word
            text += " "
        text = text.rstrip()

        entity_spans = []
        original_word_spans = []
        original_entity_spans = []
        labels = []
        for word_start in range(len(sentence_words)):
            for word_end in range(word_start, len(sentence_words)):
                if sum(sentence_subword_lengths[word_start:word_end + 1]) <= max_mention_length:
                    entity_spans.append(
                        (word_start_char_positions[word_start], word_end_char_positions[word_end])
                    )
                    original_word_spans.append(
                        (word_start, word_end + 1)
                    )
                    original_entity_span = (word_start + sentence_start, word_end + 1 + sentence_start)
                    labels.append(entities.get(original_entity_span, NON_ENTITY))
                    original_entity_spans.append(original_entity_span)

        self.tokenizer.tokenizer.task = "entity_span_classification"
        inputs = self.tokenizer.tokenizer(text, entity_spans=entity_spans)
        word_ids = self.tokenizer.tokenizer.convert_ids_to_tokens(inputs["input_ids"])
        entity_ids = inputs["entity_ids"]

        split_size = math.ceil(len(entity_ids) / self.max_entity_length)
        for i in range(split_size):
            entity_size = math.ceil(len(entity_ids) / split_size)
            start = i * entity_size
            end = start + entity_size
            fields = {
                "word_ids": TextField([Token(w) for w in word_ids], token_indexers=self.token_indexers),
                "entity_start_positions": TensorField(np.array(inputs["entity_start_positions"][start:end])),
                "entity_end_positions": TensorField(np.array(inputs["entity_end_positions"][start:end])),
                "original_entity_spans": TensorField(np.array(original_entity_spans[start:end]), padding_value=-1),
                "labels": ListField([LabelField(l) for l in labels[start:end]]),
                "doc_id": MetadataField(doc_index),
                "input_words": MetadataField(words),
                "entity_ids": TensorField(np.array(entity_ids[start:end]), padding_value=0),
                "entity_position_ids": TensorField(np.array(inputs["entity_position_ids"][start:end])),
            }
            yield Instance(fields)

studio-ousia / luke

Replace Luke with MLuke in Notebook/ConLL-2003 #166

Errors: