Open cleong110 opened 1 year ago
I'm going to make a branch and start annotating until I understand the issue.
From the traceback we can learn that we're calling the evaluate() function, from main function. In the main function and outputs when I ran this, I can deduce that this is supposed to be the part of the code that handles evaluation/validation, perhaps at the end of an epoch or number of training steps.
https://github.com/krypticmouse/double-bind-training/blob/train-lm-adapter/train_ner_adapter.py#L668 confirms that this code is evaluation, and comes in the training loop.
Looking at the full output from the wandb run shows also that it was doing Evaluation.
- This IS expected if you are initializing RobertaAdapterModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaAdapterModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaAdapterModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='data/swa/', dataset_language='ner-swa', device=device(type='cuda'), do_eval=True, do_finetune=False, do_lower_case=False, do_predict=True, do_train=True, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, input_dir=None, labels='', learning_rate=0.0005, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_seq_length=164, max_steps=-1, model_name_or_path='roberta-base', model_type='roberta', n_gpu=1, no_cuda=False, num_labels=9, num_train_epochs=3.0, output_dir='swa_sample_2', overwrite_cache=False, overwrite_output_dir=False, path_to_adapter='/tmp/test-mlm', per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, save_steps=10000, seed=1, server_ip='', server_port='', test_prediction_file='test_predictions.txt', test_result_file='test_results.txt', tokenizer_name='', warmup_steps=0, weight_decay=0.0)
Loading features from cached file data/swa/cached_train_roberta-base_164
/usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Epoch: 0% 0/3 [00:00<?, ?it/s]
/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:257: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
***** Running training *****
Num examples = 2109
Num Epochs = 3.0
Instantaneous batch size per GPU = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 198.0
warnings.warn("To get the last learning rate computed by the scheduler, "
Epoch: 33% 1/3 [00:37<01:15, 37.93s/it]
Epoch: 67% 2/3 [01:18<00:39, 39.30s/it]0it/s]
training loss 0.07202650302648544
Epoch: 100% 3/3 [02:00<00:00, 40.14s/it]4it/s]
training loss 0.11821245276927948
training loss 0.15650698980689048
global_step = 198, average loss = 0.39521967122952145
Saving model checkpoint to swa_sample_2
Evaluate the following checkpoints: ['swa_sample_2']
Evaluating: 0% 0/38 [00:00<?, ?it/s]There are adapters available but none are activated for the forward pass.
There are adapters available but none are activated for the forward pass.
Evaluating: 0% 0/38 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_ner_adapter.py", line 726, in <module>
main()
File "train_ner_adapter.py", line 685, in main
result, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="dev", prefix=global_step)
File "train_ner_adapter.py", line 288, in evaluate
logits = model(avg_emb)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/adapters/models/roberta/adapter_model.py", line 68, in forward
outputs = self.roberta(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/adapters/context.py", line 108, in wrapper_func
results = f(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/roberta/modeling_roberta.py", line 843, in forward
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (768) must match the existing size (514) at non-singleton dimension 1. Target sizes: [1312, 768]. Tensor sizes: [1, 514]
Loading features from cached file data/swa/cached_dev_roberta-base_164
***** Running evaluation *****
Num examples = 300
Batch size = 8
I mean, intuitively, there's some size mismatch between the input data and the model's expected length. This seems to be happening right at the beginning of the network, and 768 or 514 sounds about like the magnitude of "max sequence length", things like that.
But let's keep analyzing. Here's the input arguments for the run:
Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='data/swa/', dataset_language='ner-swa', device=device(type='cuda'), do_eval=True, do_finetune=False, do_lower_case=False, do_predict=True, do_train=True, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, input_dir=None, labels='', learning_rate=0.0005, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_seq_length=164, max_steps=-1, model_name_or_path='roberta-base', model_type='roberta', n_gpu=1, no_cuda=False, num_labels=9, num_train_epochs=3.0, output_dir='swa_sample_2', overwrite_cache=False, overwrite_output_dir=False, path_to_adapter='/tmp/test-mlm', per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, save_steps=10000, seed=1, server_ip='', server_port='', test_prediction_file='test_predictions.txt', test_result_file='test_results.txt', tokenizer_name='', warmup_steps=0, weight_decay=0.0)
max_seq_length=164 is set to 164. per_gpu_eval_batch_size=8, 164*8 =1312.
That's interesting, because the error message says: Target sizes: [1312, 768]. Tensor sizes: [1, 514].
And then 768, where does that come from? I think I will run it again, but this time print out some more values.
OK:
avg_emb shape: torch.Size([8, 164, 768])
Also, there are 13 layers.
When I print out the shape of the layers, they are all also torch.Size([8, 164, 768])
When I print out the model:
RobertaAdapterModel(
(shared_parameters): ModuleDict()
(roberta): RobertaModel(
(shared_parameters): ModuleDict()
(invertible_adapters): ModuleDict()
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50265, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(1): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(2): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(3): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(4): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(5): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(6): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(7): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(8): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(9): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(10): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
(11): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(key): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(value): Linear(
in_features=768, out_features=768, bias=True
(loras): ModuleDict()
)
(dropout): Dropout(p=0.1, inplace=False)
(prefix_tuning): PrefixTuningShim(
(prefix_gates): ModuleDict()
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict()
(adapter_fusion_layer): ModuleDict()
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(
in_features=768, out_features=3072, bias=True
(loras): ModuleDict()
)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(
in_features=3072, out_features=768, bias=True
(loras): ModuleDict()
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(adapters): ModuleDict(
(mlm): Adapter(
(non_linearity): Activation_Function_Class(
(f): ReLU()
)
(adapter_down): Sequential(
(0): Linear(in_features=768, out_features=48, bias=True)
(1): Activation_Function_Class(
(f): ReLU()
)
)
(adapter_up): Linear(in_features=48, out_features=768, bias=True)
)
)
(adapter_fusion_layer): ModuleDict()
)
)
)
)
(pooler): RobertaPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
(prefix_tuning): PrefixTuningPool(
(prefix_tunings): ModuleDict()
)
)
(heads): ModuleDict(
(mlm): BertStyleMaskedLMHead(
(0): Linear(in_features=768, out_features=768, bias=True)
(1): Activation_Function_Class(
(f): GELUActivation()
)
(2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(3): Linear(in_features=768, out_features=50265, bias=True)
)
(ner_head): TaggingHead(
(0): Dropout(p=0.1, inplace=False)
(1): Linear(in_features=768, out_features=9, bias=True)
)
)
)
There are adapters available but none are activated for the forward pass.
OK, so the position embeddings in the model might be the source of the 514.
Note also the fact that none of the adapters are activated.