Closed niningliuhen2013 closed 2 years ago
Hi, it seems you are running out of GPU memory -> RuntimeError: CUDA out of memory. Tried to allocate 450.00 MiB (GPU 0; 4.00 GiB total capacity; 2.46 GiB already allocated; 168.61 MiB free; 2.79 GiB reserved in total by PyTorch)
.
Please have a look at issue #3 .
I switched to cpu later and the error also occurs.
Connected to pydev debugger (build 193.6494.30) F:/pythonproject/jerex-main/jerex_train.py:24: UserWarning: 'train' is validated against ConfigStore schema with the same name. This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2. See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions. train() F:\pythonproject\jerex-main\jerex\model.py:266: UserWarning: You set 'final_valid_evaluate=True' and specified a test path. The best model will be evaluated on the dataset specified in 'test_path'. warnings.warn("You set 'final_valid_evaluate=True' and specified a test path. " datasets: train_path: ./data/datasets/docred_joint/train_joint.json valid_path: ./data/datasets/docred_joint/dev_joint.json test_path: ./data/datasets/docred_joint/test_joint.json types_path: ./data/datasets/docred_joint/types.json model: model_type: joint_multi_instance encoder_path: bert-base-cased tokenizer_path: bert-base-cased mention_threshold: 0.85 coref_threshold: 0.85 rel_threshold: 0.6 prop_drop: 0.1 meta_embedding_size: 25 size_embeddings_count: 30 ed_embeddings_count: 300 token_dist_embeddings_count: 700 sentence_dist_embeddings_count: 50 position_embeddings_count: 700 sampling: neg_mention_count: 200 neg_coref_count: 200 neg_relation_count: 200 max_span_size: 10 sampling_processes: 8 neg_mention_overlap_ratio: 0.5 lowercase: false loss: mention_weight: 1.0 coref_weight: 1.0 entity_weight: 0.25 relation_weight: 1.0 inference: valid_batch_size: 1 test_batch_size: 1 max_spans: null max_coref_pairs: null max_rel_pairs: null training: batch_size: 1 min_epochs: 20 max_epochs: 20 lr: 5.0e-05 lr_warmup: 0.1 weight_decay: 0.01 max_grad_norm: 1.0 accumulate_grad_batches: 1 max_spans: null max_coref_pairs: null max_rel_pairs: null distribution: gpus: [] accelerator: '' prepare_data_per_node: false misc: store_predictions: true store_examples: true flush_logs_every_n_steps: 1000 log_every_n_steps: 1000 deterministic: false seed: null cache_path: null precision: 32 profiler: null final_valid_evaluate: true
[2021-12-30 12:10:55,779][numexpr.utils][INFO] - NumExpr defaulting to 6 threads. Parse dataset 'F:\pythonproject\jerex-main\data\datasets\docred_joint\train_joint.json': 100%|██████████| 3008/3008 [02:53<00:00, 17.31it/s] Parse dataset 'F:\pythonproject\jerex-main\data\datasets\docred_joint\dev_joint.json': 100%|██████████| 300/300 [00:16<00:00, 17.97it/s] Some weights of the model checkpoint at bert-base-cased were not used when initializing JointMultiInstanceModel: ['cls.predictions.transform.LayerNorm.bias', 'bert.pooler.dense.bias', 'cls.predictions.decoder.weight', 'bert.pooler.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
This IS NOT expected if you are initializing JointMultiInstanceModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of JointMultiInstanceModel were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['coreference_resolution.coref_linear.weight', 'relation_classification.token_distance_embeddings.weight', 'coreference_resolution.coref_linear.bias', 'mention_localization.linear.weight', 'entity_classification.entity_classifier.weight', 'coreference_resolution.coref_classifier.weight', 'relation_classification.rel_classifier.weight', 'coreference_resolution.coref_ed_embeddings.weight', 'relation_classification.sentence_distance_embeddings.weight', 'mention_localization.mention_classifier.bias', 'relation_classification.rel_classifier.bias', 'relation_classification.rel_linear.weight', 'entity_classification.entity_classifier.bias', 'mention_localization.mention_classifier.weight', 'coreference_resolution.coref_classifier.bias', 'entity_classification.linear.weight', 'relation_classification.entity_type_embeddings.weight', 'mention_localization.size_embeddings.weight', 'mention_localization.linear.bias', 'relation_classification.rel_linear.bias', 'relation_classification.pair_linear.bias', 'relation_classification.pair_linear.weight', 'entity_classification.linear.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. GPU available: True, used: False TPU available: False, using: 0 TPU cores C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\utilities\distributed.py:68: UserWarning: GPU available but not used. Set the --gpus flag when calling the script. warnings.warn(*args, **kwargs)
113 M Trainable params 0 Non-trainable params 113 M Total params 455.954 Total estimated model params size (MB) Epoch 0: 3%|▎ | 83/3308 [08:56<5:47:29, 6.46s/it, loss=66.5, v_num=0_0]Error executing job with overrides: [] Traceback (most recent call last): File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 637, in run_train self.train_loop.run_training_epoch() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 492, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 654, in run_training_batch self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 433, in optimizer_step using_lbfgs=is_lbfgs, File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\core\lightning.py", line 1390, in optimizer_step optimizer.step(closure=optimizer_closure) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\core\optimizer.py", line 214, in step self.optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\core\optimizer.py", line 134, in optimizer_step trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 277, in optimizer_step self.run_optimizer_step(optimizer, opt_idx, lambda_closure, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 282, in run_optimizer_step self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 163, in optimizer_step optimizer.step(closure=lambda_closure, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\torch\optim\lr_scheduler.py", line 67, in wrapper return wrapped(*args, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\transformers\optimization.py", line 321, in step loss = closure() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 649, in train_step_and_backward_closure split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 742, in training_step_and_backward result = self.training_step(split_batch, batch_idx, opt_idx, hiddens) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step training_step_output = self.trainer.accelerator.training_step(args) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step return self.training_type_plugin.training_step(args) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 125, in training_step return self.lightning_module.training_step(args, kwargs) File "F:\pythonproject\jerex-main\jerex\model.py", line 114, in training_step outputs = self(batch) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "F:\pythonproject\jerex-main\jerex\model.py", line 106, in forward max_rel_pairs=max_rel_pairs, inference=inference) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "F:\pythonproject\jerex-main\jerex\models\joint_models.py", line 142, in forward return self._forward_train(*args, *kwargs) File "F:\pythonproject\jerex-main\jerex\models\joint_models.py", line 188, in _forward_train max_spans=max_spans) File "F:\pythonproject\jerex-main\jerex\models\joint_models.py", line 57, in _forward_train_common h = self.bert(input_ids=encodings, attention_mask=context_masks)['last_hidden_state'] File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\transformers\models\bert\modeling_bert.py", line 957, in forward buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length) RuntimeError: The expanded size of the tensor (562) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 562]. Tensor sizes: [1, 512]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra_internal\utils.py", line 378, in _run_hydra
lambda: hydra.run(
File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra_internal\utils.py", line 214, in run_and_report
raise ex
File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra_internal\utils.py", line 211, in run_andreport
return func()
File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra_internal\utils.py", line 381, in
This is not the same exception as before ( -> The expanded size of the tensor (562) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 562]. Tensor sizes: [1, 512]
). I'm a bit puzzled about this one. Are you using the same library versions as specified in README?
I really didn't use the same library versions. But i run it successfully when using the following library versions: torch==1.8.1 numpy==1.18.1 pytorch-lightning==1.2.7 scikit-learn==1.0.2 tqdm==4.43.0 hydra-core==1.0.6 transformers[sentencepiece]==4.5.1 Jinja2==2.11.3
Hi, I have another question to ask. How many epochs were run to get the results in the paper? I am confused how to get the result in Table 3. If you just need to train "relationship_classification_multi_instance" module on the original docred dataset to get the results of Table 3, then what is the use of "mention_localization", "coreference_resolution" and "entity_classification" for "relationship_classification_multi_instance"? I am also confused how to get the result in Table 4. Because it is not the same as the results in Table 1. Looking forward to your reply, thanks!
Hi, you need to use "mention_localization", "coreference_resolution", "entity_classification" and "relationship_classification_multi_instance" to get the results of Table 4 (right row - separate models). To compute the left row of Table 4 (joint model), you need to train the joint multi instance model and evaluate it using the four models/tasks mentioned above (this way, the jointly trained model is evaluated on separate tasks given ground truth annotations from previous steps as explained in the Paper). And for Table 3, you should use the "relationship_classification_multi_instance" model on the original split and submit the results via Codalab.
Thank you so much, I got it!
I don't know why? I didn't edit any code when running 'python ./jerex_train.py --config-path configs/docred_joint'. C:\Users\Administrator.conda\envs\torchGPU\python.exe F:/pythonproject/jerex-main/jerex_train.py --config-path configs/docred_joint F:/pythonproject/jerex-main/jerex_train.py:24: UserWarning: 'train' is validated against ConfigStore schema with the same name. This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2. See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions. train() datasets: train_path: ./data/datasets/docred_joint/train_joint.json valid_path: ./data/datasets/docred_joint/dev_joint.json test_path: null types_path: ./data/datasets/docred_joint/types.json model: model_type: joint_multi_instance encoder_path: bert-base-cased tokenizer_path: bert-base-cased mention_threshold: 0.85 coref_threshold: 0.85 rel_threshold: 0.6 prop_drop: 0.1 meta_embedding_size: 25 size_embeddings_count: 30 ed_embeddings_count: 300 token_dist_embeddings_count: 700 sentence_dist_embeddings_count: 50 position_embeddings_count: 700 sampling: neg_mention_count: 200 neg_coref_count: 200 neg_relation_count: 200 max_span_size: 10 sampling_processes: 8 neg_mention_overlap_ratio: 0.5 lowercase: false loss: mention_weight: 1.0 coref_weight: 1.0 entity_weight: 0.25 relation_weight: 1.0 inference: valid_batch_size: 1 test_batch_size: 1 max_spans: null max_coref_pairs: null max_rel_pairs: null training: batch_size: 1 min_epochs: 20 max_epochs: 20 lr: 5.0e-05 lr_warmup: 0.1 weight_decay: 0.01 max_grad_norm: 1.0 accumulate_grad_batches: 1 max_spans: null max_coref_pairs: null max_rel_pairs: null distribution: gpus:
[2021-12-29 20:27:07,884][numexpr.utils][INFO] - NumExpr defaulting to 6 threads. Parse dataset 'F:\pythonproject\jerex-main\data\datasets\docred_joint\train_joint.json': 100%|██████████| 3008/3008 [00:48<00:00, 61.77it/s] Parse dataset 'F:\pythonproject\jerex-main\data\datasets\docred_joint\dev_joint.json': 100%|██████████| 300/300 [00:04<00:00, 67.58it/s] Some weights of the model checkpoint at bert-base-cased were not used when initializing JointMultiInstanceModel: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
This IS NOT expected if you are initializing JointMultiInstanceModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of JointMultiInstanceModel were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['coreference_resolution.coref_classifier.bias', 'entity_classification.entity_classifier.weight', 'coreference_resolution.coref_linear.bias', 'entity_classification.linear.bias', 'relation_classification.token_distance_embeddings.weight', 'mention_localization.linear.weight', 'entity_classification.linear.weight', 'mention_localization.linear.bias', 'relation_classification.entity_type_embeddings.weight', 'relation_classification.rel_classifier.bias', 'entity_classification.entity_classifier.bias', 'coreference_resolution.coref_linear.weight', 'relation_classification.rel_linear.weight', 'relation_classification.sentence_distance_embeddings.weight', 'mention_localization.mention_classifier.weight', 'relation_classification.rel_classifier.weight', 'coreference_resolution.coref_classifier.weight', 'relation_classification.pair_linear.weight', 'mention_localization.mention_classifier.bias', 'mention_localization.size_embeddings.weight', 'coreference_resolution.coref_ed_embeddings.weight', 'relation_classification.pair_linear.bias', 'relation_classification.rel_linear.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. GPU available: True, used: True TPU available: False, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
0 | model | JointMultiInstanceModel | 113 M
113 M Trainable params 0 Non-trainable params 113 M Total params 455.954 Total estimated model params size (MB) Epoch 0: 0%| | 3/3308 [00:20<6:22:32, 6.94s/it, loss=69.4, v_num=0_0]Error executing job with overrides: [] Traceback (most recent call last): File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 637, in run_train self.train_loop.run_training_epoch() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 492, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 654, in run_training_batch self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 433, in optimizer_step using_lbfgs=is_lbfgs, File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\core\lightning.py", line 1390, in optimizer_step optimizer.step(closure=optimizer_closure) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\core\optimizer.py", line 214, in step self.optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\core\optimizer.py", line 134, in optimizer_step trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 277, in optimizer_step self.run_optimizer_step(optimizer, opt_idx, lambda_closure, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 282, in run_optimizer_step self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 163, in optimizer_step optimizer.step(closure=lambda_closure, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\torch\optim\lr_scheduler.py", line 67, in wrapper return wrapped(*args, kwargs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\transformers\optimization.py", line 321, in step loss = closure() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 649, in train_step_and_backward_closure split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 742, in training_step_and_backward result = self.training_step(split_batch, batch_idx, opt_idx, hiddens) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step training_step_output = self.trainer.accelerator.training_step(args) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step return self.training_type_plugin.training_step(args) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 125, in training_step return self.lightning_module.training_step(args, kwargs) File "F:\pythonproject\jerex-main\jerex\model.py", line 114, in training_step outputs = self(batch) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "F:\pythonproject\jerex-main\jerex\model.py", line 106, in forward max_rel_pairs=max_rel_pairs, inference=inference) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "F:\pythonproject\jerex-main\jerex\models\joint_models.py", line 142, in forward return self._forward_train(*args, *kwargs) File "F:\pythonproject\jerex-main\jerex\models\joint_models.py", line 198, in _forward_train rel_entity_types, max_pairs=max_rel_pairs) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "F:\pythonproject\jerex-main\jerex\models\modules\relation_classification_multi_instance.py", line 49, in forward chunk_rel_sentence_distances, mention_reprs, chunk_h) File "F:\pythonproject\jerex-main\jerex\models\modules\relation_classification_multi_instance.py", line 73, in _create_mention_pair_representations rel_ctx = m + h RuntimeError: CUDA out of memory. Tried to allocate 450.00 MiB (GPU 0; 4.00 GiB total capacity; 2.46 GiB already allocated; 168.61 MiB free; 2.79 GiB reserved in total by PyTorch)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "F:/pythonproject/jerex-main/jerex_train.py", line 20, in train model.train(cfg) File "F:\pythonproject\jerex-main\jerex\model.py", line 341, in train trainer.fit(model, datamodule=data_module) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 499, in fit self.dispatch() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 546, in dispatch self.accelerator.start_training(self) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 670, in run_train self.train_loop.on_train_end() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 134, in on_train_end self.check_checkpoint_callback(should_update=True, is_last=True) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 164, in check_checkpoint_callback cb.on_validation_end(self.trainer, model) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 212, in on_validation_end self.save_checkpoint(trainer, pl_module) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 247, in save_checkpoint self._validate_monitor_key(trainer) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 495, in _validate_monitor_key raise MisconfigurationException(m) pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='valid_f1') not found in the returned metrics: ['train_mention_loss', 'train_coref_loss', 'train_entity_loss', 'train_rel_loss', 'train_loss']. HINT: Did you call self.log('valid_f1', value) in the LightningModule?
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Epoch 0: 0%| | 3/3308 [00:21<6:39:12, 7.25s/it, loss=69.4, v_num=0_0]