deberta-v3 encoder error

Hi there!
I was playing around with your google colab and wanted to few-nerd on encoder_id = "microsoft/deberta-v3-large"
but when I reach the train part if fails with RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 2:

Tokenizing the train dataset: 100%
 131767/131767 [01:16<00:00, 1661.38 examples/s]
This SpanMarker model will ignore 0.339320% of all annotated entities in the train dataset. This is caused by the SpanMarkerModel maximum entity length of 8 words and the maximum model input length of 256 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 340387 total entities:
- 486 missed entities with 9 words (0.142779%)
- 245 missed entities with 10 words (0.071977%)
- 119 missed entities with 11 words (0.034960%)
- 92 missed entities with 12 words (0.027028%)
- 57 missed entities with 13 words (0.016746%)
- 36 missed entities with 14 words (0.010576%)
- 17 missed entities with 15 words (0.004994%)
- 14 missed entities with 16 words (0.004113%)
- 10 missed entities with 17 words (0.002938%)
- 4 missed entities with 18 words (0.001175%)
- 5 missed entities with 19 words (0.001469%)
- 3 missed entities with 20 words (0.000881%)
- 4 missed entities with 21 words (0.001175%)
- 1 missed entities with 22 words (0.000294%)
- 2 missed entities with 23 words (0.000588%)
- 3 missed entities with 24 words (0.000881%)
- 2 missed entities with 25 words (0.000588%)
- 2 missed entities with 26 words (0.000588%)
- 1 missed entities with 27 words (0.000294%)
- 1 missed entities with 29 words (0.000294%)
Additionally, a total of 51 (0.014983%) entities were missed due to the maximum input length.
Spreading data between multiple samples: 100%
 131767/131767 [00:18<00:00, 7164.64 examples/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[8], line 1
----> 1 trainer.train()

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1535         hf_hub_utils.enable_progress_bars()
   1536 else:
-> 1537     return inner_training_loop(
   1538         args=args,
   1539         resume_from_checkpoint=resume_from_checkpoint,
   1540         trial=trial,
   1541         ignore_keys_for_eval=ignore_keys_for_eval,
   1542     )

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1851     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   1853 with self.accelerator.accumulate(model):
-> 1854     tr_loss_step = self.training_step(model, inputs)
   1856 if (
   1857     args.logging_nan_inf_filter
   1858     and not is_torch_tpu_available()
   1859     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1860 ):
   1861     # if loss is nan or inf simply add the average of previous logged losses
   1862     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:2735, in Trainer.training_step(self, model, inputs)
   2732     return loss_mb.reduce_mean().detach().to(self.args.device)
   2734 with self.compute_loss_context_manager():
-> 2735     loss = self.compute_loss(model, inputs)
   2737 if self.args.n_gpu > 1:
   2738     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:2758, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2756 else:
   2757     labels = None
-> 2758 outputs = model(**inputs)
   2759 # Save past state if it exists
   2760 # TODO: this needs to be fixed and made cleaner later.
   2761 if self.args.past_index >= 0:

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File /shared/jupyter/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py:687, in convert_outputs_to_fp32.<locals>.forward(*args, **kwargs)
    686 def forward(*args, **kwargs):
--> 687     return model_forward(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py:675, in ConvertOutputsToFp32.__call__(self, *args, **kwargs)
    674 def __call__(self, *args, **kwargs):
--> 675     return convert_to_fp32(self.model_forward(*args, **kwargs))

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:16, in autocast_decorator.<locals>.decorate_autocast(*args, **kwargs)
     13 @functools.wraps(func)
     14 def decorate_autocast(*args, **kwargs):
     15     with autocast_instance:
---> 16         return func(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/span_marker/modeling.py:153, in SpanMarkerModel.forward(self, input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels, num_words, document_ids, sentence_ids, **kwargs)
    136 """Forward call of the SpanMarkerModel.
    137 
    138 Args:
   (...)
    150     SpanMarkerOutput: The output dataclass.
    151 """
    152 token_type_ids = torch.zeros_like(input_ids)
--> 153 outputs = self.encoder(
    154     input_ids,
    155     attention_mask=attention_mask,
    156     token_type_ids=token_type_ids,
    157     position_ids=position_ids,
    158 )
    159 last_hidden_state = outputs[0]
    160 last_hidden_state = self.dropout(last_hidden_state)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:1062, in DebertaV2Model.forward(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, output_attentions, output_hidden_states, return_dict)
   1059 if token_type_ids is None:
   1060     token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
-> 1062 embedding_output = self.embeddings(
   1063     input_ids=input_ids,
   1064     token_type_ids=token_type_ids,
   1065     position_ids=position_ids,
   1066     mask=attention_mask,
   1067     inputs_embeds=inputs_embeds,
   1068 )
   1070 encoder_outputs = self.encoder(
   1071     embedding_output,
   1072     attention_mask,
   (...)
   1075     return_dict=return_dict,
   1076 )
   1077 encoded_layers = encoder_outputs[1]

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:900, in DebertaV2Embeddings.forward(self, input_ids, token_type_ids, position_ids, mask, inputs_embeds)
    897         mask = mask.unsqueeze(2)
    898     mask = mask.to(embeddings.dtype)
--> 900     embeddings = embeddings * mask
    902 embeddings = self.dropout(embeddings)
    903 return embeddings

RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 2
Any ideas if this encoder will work and how to make it work? Thanks! No issues if I run it on roberta-large for example
tomaarsen / SpanMarkerNER

deberta-v3 encoder error #56