tomaarsen / SpanMarkerNER

SpanMarker for Named Entity Recognition
https://tomaarsen.github.io/SpanMarkerNER/
Apache License 2.0
382 stars 27 forks source link

deberta-v3 encoder error #56

Open eek opened 5 months ago

eek commented 5 months ago

Hi there!

I was playing around with your google colab and wanted to few-nerd on encoder_id = "microsoft/deberta-v3-large"

but when I reach the train part if fails with RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 2:


Tokenizing the train dataset: 100%
 131767/131767 [01:16<00:00, 1661.38 examples/s]
This SpanMarker model will ignore 0.339320% of all annotated entities in the train dataset. This is caused by the SpanMarkerModel maximum entity length of 8 words and the maximum model input length of 256 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 340387 total entities:
- 486 missed entities with 9 words (0.142779%)
- 245 missed entities with 10 words (0.071977%)
- 119 missed entities with 11 words (0.034960%)
- 92 missed entities with 12 words (0.027028%)
- 57 missed entities with 13 words (0.016746%)
- 36 missed entities with 14 words (0.010576%)
- 17 missed entities with 15 words (0.004994%)
- 14 missed entities with 16 words (0.004113%)
- 10 missed entities with 17 words (0.002938%)
- 4 missed entities with 18 words (0.001175%)
- 5 missed entities with 19 words (0.001469%)
- 3 missed entities with 20 words (0.000881%)
- 4 missed entities with 21 words (0.001175%)
- 1 missed entities with 22 words (0.000294%)
- 2 missed entities with 23 words (0.000588%)
- 3 missed entities with 24 words (0.000881%)
- 2 missed entities with 25 words (0.000588%)
- 2 missed entities with 26 words (0.000588%)
- 1 missed entities with 27 words (0.000294%)
- 1 missed entities with 29 words (0.000294%)
Additionally, a total of 51 (0.014983%) entities were missed due to the maximum input length.
Spreading data between multiple samples: 100%
 131767/131767 [00:18<00:00, 7164.64 examples/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[8], line 1
----> 1 trainer.train()

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1535         hf_hub_utils.enable_progress_bars()
   1536 else:
-> 1537     return inner_training_loop(
   1538         args=args,
   1539         resume_from_checkpoint=resume_from_checkpoint,
   1540         trial=trial,
   1541         ignore_keys_for_eval=ignore_keys_for_eval,
   1542     )

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1851     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   1853 with self.accelerator.accumulate(model):
-> 1854     tr_loss_step = self.training_step(model, inputs)
   1856 if (
   1857     args.logging_nan_inf_filter
   1858     and not is_torch_tpu_available()
   1859     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1860 ):
   1861     # if loss is nan or inf simply add the average of previous logged losses
   1862     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:2735, in Trainer.training_step(self, model, inputs)
   2732     return loss_mb.reduce_mean().detach().to(self.args.device)
   2734 with self.compute_loss_context_manager():
-> 2735     loss = self.compute_loss(model, inputs)
   2737 if self.args.n_gpu > 1:
   2738     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:2758, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2756 else:
   2757     labels = None
-> 2758 outputs = model(**inputs)
   2759 # Save past state if it exists
   2760 # TODO: this needs to be fixed and made cleaner later.
   2761 if self.args.past_index >= 0:

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File /shared/jupyter/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py:687, in convert_outputs_to_fp32.<locals>.forward(*args, **kwargs)
    686 def forward(*args, **kwargs):
--> 687     return model_forward(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py:675, in ConvertOutputsToFp32.__call__(self, *args, **kwargs)
    674 def __call__(self, *args, **kwargs):
--> 675     return convert_to_fp32(self.model_forward(*args, **kwargs))

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:16, in autocast_decorator.<locals>.decorate_autocast(*args, **kwargs)
     13 @functools.wraps(func)
     14 def decorate_autocast(*args, **kwargs):
     15     with autocast_instance:
---> 16         return func(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/span_marker/modeling.py:153, in SpanMarkerModel.forward(self, input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels, num_words, document_ids, sentence_ids, **kwargs)
    136 """Forward call of the SpanMarkerModel.
    137 
    138 Args:
   (...)
    150     SpanMarkerOutput: The output dataclass.
    151 """
    152 token_type_ids = torch.zeros_like(input_ids)
--> 153 outputs = self.encoder(
    154     input_ids,
    155     attention_mask=attention_mask,
    156     token_type_ids=token_type_ids,
    157     position_ids=position_ids,
    158 )
    159 last_hidden_state = outputs[0]
    160 last_hidden_state = self.dropout(last_hidden_state)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:1062, in DebertaV2Model.forward(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, output_attentions, output_hidden_states, return_dict)
   1059 if token_type_ids is None:
   1060     token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
-> 1062 embedding_output = self.embeddings(
   1063     input_ids=input_ids,
   1064     token_type_ids=token_type_ids,
   1065     position_ids=position_ids,
   1066     mask=attention_mask,
   1067     inputs_embeds=inputs_embeds,
   1068 )
   1070 encoder_outputs = self.encoder(
   1071     embedding_output,
   1072     attention_mask,
   (...)
   1075     return_dict=return_dict,
   1076 )
   1077 encoded_layers = encoder_outputs[1]

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:900, in DebertaV2Embeddings.forward(self, input_ids, token_type_ids, position_ids, mask, inputs_embeds)
    897         mask = mask.unsqueeze(2)
    898     mask = mask.to(embeddings.dtype)
--> 900     embeddings = embeddings * mask
    902 embeddings = self.dropout(embeddings)
    903 return embeddings

RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 2

Any ideas if this encoder will work and how to make it work? Thanks! No issues if I run it on roberta-large for example

tomaarsen commented 5 months ago

Hello!

Apologies for the delay. I don't fully remember, but based on 41fdda8f38628c84f40b54d36de7f8ffad5f5843 it seems that DeBERTa is indeed not supported in SpanMarker as

DeBERTa doesn't support attention mask matrices

So, it's best to stick with BERT or RoBERTa models, it seems!