krypticmouse / double-bind-training

0 stars 3 forks source link

RuntimeError in train_ner_adapter: expanded size of the tensor must match the existing size at non-singleton dimension 1. #1

Open cleong110 opened 1 year ago

cleong110 commented 1 year ago
Evaluating:   0% 0/38 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train_ner_adapter.py", line 726, in <module>
    main()
  File "train_ner_adapter.py", line 685, in main
    result, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="dev", prefix=global_step)
  File "train_ner_adapter.py", line 288, in evaluate
    logits = model(avg_emb)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/adapters/models/roberta/adapter_model.py", line 68, in forward
    outputs = self.roberta(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/adapters/context.py", line 108, in wrapper_func
    results = f(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/roberta/modeling_roberta.py", line 843, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (768) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1312, 768].  Tensor sizes: [1, 514]
cleong110 commented 1 year ago

I'm going to make a branch and start annotating until I understand the issue.

cleong110 commented 1 year ago

From the traceback we can learn that we're calling the evaluate() function, from main function. In the main function and outputs when I ran this, I can deduce that this is supposed to be the part of the code that handles evaluation/validation, perhaps at the end of an epoch or number of training steps.

cleong110 commented 1 year ago

https://github.com/krypticmouse/double-bind-training/blob/train-lm-adapter/train_ner_adapter.py#L668 confirms that this code is evaluation, and comes in the training loop.

cleong110 commented 1 year ago

Looking at the full output from the wandb run shows also that it was doing Evaluation.

- This IS expected if you are initializing RobertaAdapterModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaAdapterModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaAdapterModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='data/swa/', dataset_language='ner-swa', device=device(type='cuda'), do_eval=True, do_finetune=False, do_lower_case=False, do_predict=True, do_train=True, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, input_dir=None, labels='', learning_rate=0.0005, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_seq_length=164, max_steps=-1, model_name_or_path='roberta-base', model_type='roberta', n_gpu=1, no_cuda=False, num_labels=9, num_train_epochs=3.0, output_dir='swa_sample_2', overwrite_cache=False, overwrite_output_dir=False, path_to_adapter='/tmp/test-mlm', per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, save_steps=10000, seed=1, server_ip='', server_port='', test_prediction_file='test_predictions.txt', test_result_file='test_results.txt', tokenizer_name='', warmup_steps=0, weight_decay=0.0)
Loading features from cached file data/swa/cached_train_roberta-base_164
/usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
Epoch:   0% 0/3 [00:00<?, ?it/s]
/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:257: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
***** Running training *****
  Num examples =  2109
  Num Epochs =  3.0
  Instantaneous batch size per GPU =  32
  Total train batch size (w. parallel, distributed & accumulation) =  32
  Gradient Accumulation steps =  1
  Total optimization steps =  198.0
  warnings.warn("To get the last learning rate computed by the scheduler, "
Epoch:  33% 1/3 [00:37<01:15, 37.93s/it]
Epoch:  67% 2/3 [01:18<00:39, 39.30s/it]0it/s]
training loss 0.07202650302648544
Epoch: 100% 3/3 [02:00<00:00, 40.14s/it]4it/s]
training loss 0.11821245276927948
training loss 0.15650698980689048
 global_step = 198, average loss = 0.39521967122952145
Saving model checkpoint to swa_sample_2
Evaluate the following checkpoints:  ['swa_sample_2']
Evaluating:   0% 0/38 [00:00<?, ?it/s]There are adapters available but none are activated for the forward pass.
There are adapters available but none are activated for the forward pass.
Evaluating:   0% 0/38 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train_ner_adapter.py", line 726, in <module>
    main()
  File "train_ner_adapter.py", line 685, in main
    result, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="dev", prefix=global_step)
  File "train_ner_adapter.py", line 288, in evaluate
    logits = model(avg_emb)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/adapters/models/roberta/adapter_model.py", line 68, in forward
    outputs = self.roberta(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/adapters/context.py", line 108, in wrapper_func
    results = f(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/roberta/modeling_roberta.py", line 843, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (768) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1312, 768].  Tensor sizes: [1, 514]
Loading features from cached file data/swa/cached_dev_roberta-base_164
***** Running evaluation  *****
  Num examples = 300
  Batch size = 8 
cleong110 commented 1 year ago

I mean, intuitively, there's some size mismatch between the input data and the model's expected length. This seems to be happening right at the beginning of the network, and 768 or 514 sounds about like the magnitude of "max sequence length", things like that.

cleong110 commented 1 year ago

But let's keep analyzing. Here's the input arguments for the run:

Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='data/swa/', dataset_language='ner-swa', device=device(type='cuda'), do_eval=True, do_finetune=False, do_lower_case=False, do_predict=True, do_train=True, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, input_dir=None, labels='', learning_rate=0.0005, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_seq_length=164, max_steps=-1, model_name_or_path='roberta-base', model_type='roberta', n_gpu=1, no_cuda=False, num_labels=9, num_train_epochs=3.0, output_dir='swa_sample_2', overwrite_cache=False, overwrite_output_dir=False, path_to_adapter='/tmp/test-mlm', per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, save_steps=10000, seed=1, server_ip='', server_port='', test_prediction_file='test_predictions.txt', test_result_file='test_results.txt', tokenizer_name='', warmup_steps=0, weight_decay=0.0) 
cleong110 commented 1 year ago

max_seq_length=164 is set to 164. per_gpu_eval_batch_size=8, 164*8 =1312.

That's interesting, because the error message says: Target sizes: [1312, 768]. Tensor sizes: [1, 514].

cleong110 commented 1 year ago

And then 768, where does that come from? I think I will run it again, but this time print out some more values.

cleong110 commented 1 year ago

OK:

avg_emb shape: torch.Size([8, 164, 768])

Also, there are 13 layers.

When I print out the shape of the layers, they are all also torch.Size([8, 164, 768])

cleong110 commented 1 year ago

When I print out the model:

RobertaAdapterModel(
  (shared_parameters): ModuleDict()
  (roberta): RobertaModel(
    (shared_parameters): ModuleDict()
    (invertible_adapters): ModuleDict()
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (1): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (2): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (3): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (4): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (5): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (6): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (7): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (8): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (9): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (10): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
        (11): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (prefix_tuning): PrefixTuningShim(
                (prefix_gates): ModuleDict()
                (pool): PrefixTuningPool(
                  (prefix_tunings): ModuleDict()
                )
              )
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
              (adapters): ModuleDict()
              (adapter_fusion_layer): ModuleDict()
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(
              in_features=768, out_features=3072, bias=True
              (loras): ModuleDict()
            )
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(
              in_features=3072, out_features=768, bias=True
              (loras): ModuleDict()
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (adapters): ModuleDict(
              (mlm): Adapter(
                (non_linearity): Activation_Function_Class(
                  (f): ReLU()
                )
                (adapter_down): Sequential(
                  (0): Linear(in_features=768, out_features=48, bias=True)
                  (1): Activation_Function_Class(
                    (f): ReLU()
                  )
                )
                (adapter_up): Linear(in_features=48, out_features=768, bias=True)
              )
            )
            (adapter_fusion_layer): ModuleDict()
          )
        )
      )
    )
    (pooler): RobertaPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
    (prefix_tuning): PrefixTuningPool(
      (prefix_tunings): ModuleDict()
    )
  )
  (heads): ModuleDict(
    (mlm): BertStyleMaskedLMHead(
      (0): Linear(in_features=768, out_features=768, bias=True)
      (1): Activation_Function_Class(
        (f): GELUActivation()
      )
      (2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (3): Linear(in_features=768, out_features=50265, bias=True)
    )
    (ner_head): TaggingHead(
      (0): Dropout(p=0.1, inplace=False)
      (1): Linear(in_features=768, out_features=9, bias=True)
    )
  )
)
There are adapters available but none are activated for the forward pass.
cleong110 commented 1 year ago

OK, so the position embeddings in the model might be the source of the 514.

Note also the fact that none of the adapters are activated.