Open mrpeerat opened 2 years ago
Hi @mrpeerat, thank you for reporting the problem.
It seems that the problem can be solved by setting tokenizer.task = "entity_span_classification"
.
This should have been set by default when instantiating the tokenizer from the huggingface hub... We have fixed that, so it should be fine if you re-run the entire notebook. Or you could alleviate the issue by manually setting the task attribute as above.
Hi again, the bug is fixed. Thank you for the suggestion @Ryou0634 .
However, I found that some of mLUKE's predictions are missing.
For example, I use Spanish CoNLL-2002 from https://www.kaggle.com/datasets/nltkdata/conll-corpora (esp.testa)
Everything looks fine until the evaluation part.
print(seqeval.metrics.classification_report([final_labels], [final_predictions], digits=4))
ValueError Traceback (most recent call last) Cell In [36], line 1 ----> 1 print(seqeval.metrics.classification_report([final_labels], [final_predictions], digits=4))
File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/sequence_labeling.py:692, in classification_report(y_true, y_pred, digits, suffix, output_dict, mode, sample_weight, zero_division, scheme) 689 reporter = StringReporter(width=width, digits=digits) 691 # compute per-class scores. --> 692 p, r, f1, s = precision_recall_fscore_support( 693 y_true, y_pred, 694 average=None, 695 sample_weight=sample_weight, 696 zero_division=zero_division, 697 suffix=suffix 698 ) 699 for row in zip(target_names, p, r, f1, s): 700 reporter.write(*row)
File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/sequence_labeling.py:130, in precision_recall_fscore_support(y_true, y_pred, average, warn_for, beta, sample_weight, zero_division, suffix) 126 true_sum = np.append(true_sum, len(entities_true_type)) 128 return pred_sum, tp_sum, true_sum --> 130 precision, recall, f_score, true_sum = _precision_recall_fscore_support( 131 y_true, y_pred, 132 average=average, 133 warn_for=warn_for, 134 beta=beta, 135 sample_weight=sample_weight, 136 zero_division=zero_division, 137 scheme=None, 138 suffix=suffix, 139 extract_tp_actual_correct=extract_tp_actual_correct 140 ) 142 return precision, recall, f_score, true_sum
File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/v1.py:122, in _precision_recall_fscore_support(y_true, y_pred, average, warn_for, beta, sample_weight, zero_division, scheme, suffix, extract_tp_actual_correct) 119 if average not in average_options: 120 raise ValueError('average has to be one of {}'.format(average_options)) --> 122 check_consistent_length(y_true, y_pred) 124 pred_sum, tp_sum, true_sum = extract_tp_actual_correct(y_true, y_pred, suffix, scheme) 126 if average == 'micro':
File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/seqeval/metrics/v1.py:101, in check_consistent_length(y_true, y_pred) 99 if len(y_true) != len(y_pred) or len_true != len_pred: 100 message = 'Found input variables with inconsistent numbers of samples:\n{}\n{}'.format(len_true, len_pred) --> 101 raise ValueError(message)
ValueError: Found input variables with inconsistent numbers of samples: [52923] [52911]
Thank you.
Also, the performance of mLUKE on CoNLL-2003 (English) is dropped significantly from the paper (I use the eng.testb).
It seems the notebook is not compatible with mLUKE, because the notebook was created for LUKE and there have been significant updates in the codebase in the repository... (I guess that something is not compatible with MLukeTokenizer in preprocessing)
You may consider using examples/ner/evaluate_transformers_checkpoint.py
to evaluate multilingual models.
I got the same results of mluke-large-lite-finetuned-conll-2003
on CoNLL-2003 (English) when using the notebook (while luke-large-finetuned-conll-2003
worked fine).
Instead, I tried the reproduction using examples/ner/evaluate_transformers_checkpoint.py
, but the performance degraded more.
(luke-large-finetuned-conll-2003
using the script also did not work.)
Could you give me some advice?
$ python examples/ner/evaluate_transformers_checkpoint.py data/conll-2003/eng.testb studio-ousia/mluke-large-lite-finetuned-conll-2003 --cuda-device 0
Use the tokenizer: studio-ousia/mluke-large-lite
Use the model config: examples/ner/configs/lib/transformers_model_luke.jsonnet
[2023-06-01 19:13:25,548] [INFO] type = span_ner
[2023-06-01 19:13:25,549] [INFO] feature_extractor.type = token+entity
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.type = transformers-luke
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.model_name = studio-ousia/mluke-large-lite-finetuned-conll-2003
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.train_parameters = True
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.output_embeddings = token+entity
[2023-06-01 19:13:25,550] [INFO] feature_extractor.embedder.use_entity_aware_attention = False
Some weights of the model checkpoint at studio-ousia/mluke-large-lite-finetuned-conll-2003 were not used when initializing LukeModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing LukeModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LukeModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2023-06-01 19:13:31,430] [INFO] dropout = 0.1
[2023-06-01 19:13:31,430] [INFO] label_name_space = labels
[2023-06-01 19:13:31,430] [INFO] text_field_key = tokens
[2023-06-01 19:13:31,430] [INFO] prediction_save_path = None
loading instances: 5211it [00:17, 292.83it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 163/163 [03:10<00:00, 1.17s/it]
{'f1': 0.0024849130280440185, 'precision': 0.0012468827930174563, 'recall': 0.35, 'span_accuracy': 0.9846012216309538}
After editing examples/ner/evaluate_transformers_checkpoint.py
to let ConllSpanReader
follow iob_scheme="iob1"
, luke-large-finetuned-conll-2003
and mluke-large-lite-finetuned-conll-2003
gave 94.61 and 94.05 F1 scores, respectively.
I will try to update the notebook for mLUKE so that I and other users can easily reproduce the results.
@chantera The dataset format was somewhat obscured within the dataset reader's option, and we should have made it more explicit. Thank you for bringing this to users' attention!
I finally figured out why the performances of mluke-large-lite-finetuned-conll-2003
differ in the notebook and examples/ner/evaluate_transformers_checkpoint.py
.
For the evaluation script, TokenEntityNERFeatureExtractor
prepares entity_attention_mask
as follows (examples/ner/modules/token_and_entity.py#L31):
inputs["entity_attention_mask"] = entity_ids != 0
However, entity_ids
are all 0 because entity_mask_token
("[MASK]") 's id is 0 when using the Luke/MLuke tokenizer.
In addition, entity_ids
are padded with 0 for batch computation (examples/ner/reader.py#L206).
As a result, entity_attention_mask
are all False
; thus, attention weights would not be appropriately calculated.
To ensure that this causes the discrepancy, I added the following code in the notebook.
inputs = tokenizer(texts, entity_spans=entity_spans, return_tensors="pt", padding=True)
###
inputs["entity_attention_mask"] = torch.zeros_like(inputs["entity_attention_mask"])
###
inputs = inputs.to("cuda")
with torch.no_grad():
outputs = model(**inputs)
all_logits.extend(outputs.logits.tolist())
This worked and gave 94.23 F1 score.
I will send a PR to fix this soon.
I guess that something is not compatible with MLukeTokenizer in preprocessing
Note that the preprocessing in the notebook did not cause the problem. I adapted it for the reader as follows and confirmed that this achieved a similar performance:
def data_to_instance(self, words: List[str], labels: List[str], sentence_boundaries: List[int], doc_index: str):
subword_lengths = [len(self.tokenizer.tokenize(w)) for w in words]
total_subword_length = sum(subword_lengths)
max_token_length = self.max_num_subwords
max_mention_length = self.max_mention_length
entities = {}
for s, e in zip(sentence_boundaries[:-1], sentence_boundaries[1:]):
for ent in Entities([labels[s:e]], scheme=self.iob_scheme).entities[0]:
entities[(ent.start + s, ent.end + s)] = ent.tag
for i in range(len(sentence_boundaries) - 1):
sentence_start, sentence_end = sentence_boundaries[i:i+2]
if total_subword_length <= max_token_length:
context_start = 0
context_end = len(words)
else:
context_start = sentence_start
context_end = sentence_end
cur_length = sum(subword_lengths[context_start:context_end])
while True:
if context_start > 0:
if cur_length + subword_lengths[context_start - 1] <= max_token_length:
cur_length += subword_lengths[context_start - 1]
context_start -= 1
else:
break
if context_end < len(words):
if cur_length + subword_lengths[context_end] <= max_token_length:
cur_length += subword_lengths[context_end]
context_end += 1
else:
break
text = ""
for word in words[context_start:sentence_start]:
# if word[0] == "'" or (len(word) == 1 and is_punctuation(word)):
# text = text.rstrip()
text += word
text += " "
sentence_words = words[sentence_start:sentence_end]
sentence_subword_lengths = subword_lengths[sentence_start:sentence_end]
word_start_char_positions = []
word_end_char_positions = []
for word in sentence_words:
# if word[0] == "'" or (len(word) == 1 and is_punctuation(word)):
# text = text.rstrip()
word_start_char_positions.append(len(text))
text += word
word_end_char_positions.append(len(text))
text += " "
for word in words[sentence_end:context_end]:
# if word[0] == "'" or (len(word) == 1 and is_punctuation(word)):
# text = text.rstrip()
text += word
text += " "
text = text.rstrip()
entity_spans = []
original_word_spans = []
original_entity_spans = []
labels = []
for word_start in range(len(sentence_words)):
for word_end in range(word_start, len(sentence_words)):
if sum(sentence_subword_lengths[word_start:word_end + 1]) <= max_mention_length:
entity_spans.append(
(word_start_char_positions[word_start], word_end_char_positions[word_end])
)
original_word_spans.append(
(word_start, word_end + 1)
)
original_entity_span = (word_start + sentence_start, word_end + 1 + sentence_start)
labels.append(entities.get(original_entity_span, NON_ENTITY))
original_entity_spans.append(original_entity_span)
self.tokenizer.tokenizer.task = "entity_span_classification"
inputs = self.tokenizer.tokenizer(text, entity_spans=entity_spans)
word_ids = self.tokenizer.tokenizer.convert_ids_to_tokens(inputs["input_ids"])
entity_ids = inputs["entity_ids"]
split_size = math.ceil(len(entity_ids) / self.max_entity_length)
for i in range(split_size):
entity_size = math.ceil(len(entity_ids) / split_size)
start = i * entity_size
end = start + entity_size
fields = {
"word_ids": TextField([Token(w) for w in word_ids], token_indexers=self.token_indexers),
"entity_start_positions": TensorField(np.array(inputs["entity_start_positions"][start:end])),
"entity_end_positions": TensorField(np.array(inputs["entity_end_positions"][start:end])),
"original_entity_spans": TensorField(np.array(original_entity_spans[start:end]), padding_value=-1),
"labels": ListField([LabelField(l) for l in labels[start:end]]),
"doc_id": MetadataField(doc_index),
"input_words": MetadataField(words),
"entity_ids": TensorField(np.array(entity_ids[start:end]), padding_value=0),
"entity_position_ids": TensorField(np.array(inputs["entity_position_ids"][start:end])),
}
yield Instance(fields)
Hi!
I'm trying to run MLuke on https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb by replacing
studio-ousia/luke-large-finetuned-conll-2003
withstudio-ousia/mluke-large-lite-finetuned-conll-2003
and changingLukeTokenizer
toMLukeTokenizer
. Every thing looks find until the block:The error is
AttributeError Traceback (most recent call last) Cell In [8], line 12 10 inputs = inputs.to("cuda") 11 with torch.no_grad(): ---> 12 outputs = model(**inputs) 13 all_logits.extend(outputs.logits.tolist())
File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, *kwargs) 1098 # If we don't have any hooks, we want to skip the rest of the logic in 1099 # this function, and just call forward. 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1101 or _global_forward_hooks or _global_forward_pre_hooks): -> 1102 return forward_call(input, **kwargs) 1103 # Do not call functions when jit is used 1104 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/envs/spacy_env/lib/python3.9/site-packages/transformers/models/luke/modeling_luke.py:1588, in LukeForEntitySpanClassification.forward(self, input_ids, attention_mask, token_type_ids, position_ids, entity_ids, entity_attention_mask, entity_token_type_ids, entity_position_ids, entity_start_positions, entity_end_positions, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict) 1571 outputs = self.luke( 1572 input_ids=input_ids, 1573 attention_mask=attention_mask, (...) 1584 return_dict=True, 1585 ) 1586 hidden_size = outputs.last_hidden_state.size(-1) -> 1588 entity_start_positions = entity_start_positions.unsqueeze(-1).expand(-1, -1, hidden_size) 1589 start_states = torch.gather(outputs.last_hidden_state, -2, entity_start_positions) 1590 entity_end_positions = entity_end_positions.unsqueeze(-1).expand(-1, -1, hidden_size)
AttributeError: 'NoneType' object has no attribute 'unsqueeze'
Thank you.