lavis-nlp / jerex

PyTorch code for JEREX: Joint Entity-Level Relation Extractor
MIT License
63 stars 15 forks source link

train problem #12

Closed niningliuhen2013 closed 2 years ago

niningliuhen2013 commented 2 years ago

I don't know why? I didn't edit any code when running 'python ./jerex_train.py --config-path configs/docred_joint'. C:\Users\Administrator.conda\envs\torchGPU\python.exe F:/pythonproject/jerex-main/jerex_train.py --config-path configs/docred_joint F:/pythonproject/jerex-main/jerex_train.py:24: UserWarning: 'train' is validated against ConfigStore schema with the same name. This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2. See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions. train() datasets: train_path: ./data/datasets/docred_joint/train_joint.json valid_path: ./data/datasets/docred_joint/dev_joint.json test_path: null types_path: ./data/datasets/docred_joint/types.json model: model_type: joint_multi_instance encoder_path: bert-base-cased tokenizer_path: bert-base-cased mention_threshold: 0.85 coref_threshold: 0.85 rel_threshold: 0.6 prop_drop: 0.1 meta_embedding_size: 25 size_embeddings_count: 30 ed_embeddings_count: 300 token_dist_embeddings_count: 700 sentence_dist_embeddings_count: 50 position_embeddings_count: 700 sampling: neg_mention_count: 200 neg_coref_count: 200 neg_relation_count: 200 max_span_size: 10 sampling_processes: 8 neg_mention_overlap_ratio: 0.5 lowercase: false loss: mention_weight: 1.0 coref_weight: 1.0 entity_weight: 0.25 relation_weight: 1.0 inference: valid_batch_size: 1 test_batch_size: 1 max_spans: null max_coref_pairs: null max_rel_pairs: null training: batch_size: 1 min_epochs: 20 max_epochs: 20 lr: 5.0e-05 lr_warmup: 0.1 weight_decay: 0.01 max_grad_norm: 1.0 accumulate_grad_batches: 1 max_spans: null max_coref_pairs: null max_rel_pairs: null distribution: gpus:

[2021-12-29 20:27:07,884][numexpr.utils][INFO] - NumExpr defaulting to 6 threads. Parse dataset 'F:\pythonproject\jerex-main\data\datasets\docred_joint\train_joint.json': 100%|██████████| 3008/3008 [00:48<00:00, 61.77it/s] Parse dataset 'F:\pythonproject\jerex-main\data\datasets\docred_joint\dev_joint.json': 100%|██████████| 300/300 [00:04<00:00, 67.58it/s] Some weights of the model checkpoint at bert-base-cased were not used when initializing JointMultiInstanceModel: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "F:/pythonproject/jerex-main/jerex_train.py", line 20, in train model.train(cfg) File "F:\pythonproject\jerex-main\jerex\model.py", line 341, in train trainer.fit(model, datamodule=data_module) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 499, in fit self.dispatch() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 546, in dispatch self.accelerator.start_training(self) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 670, in run_train self.train_loop.on_train_end() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 134, in on_train_end self.check_checkpoint_callback(should_update=True, is_last=True) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 164, in check_checkpoint_callback cb.on_validation_end(self.trainer, model) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 212, in on_validation_end self.save_checkpoint(trainer, pl_module) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 247, in save_checkpoint self._validate_monitor_key(trainer) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 495, in _validate_monitor_key raise MisconfigurationException(m) pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='valid_f1') not found in the returned metrics: ['train_mention_loss', 'train_coref_loss', 'train_entity_loss', 'train_rel_loss', 'train_loss']. HINT: Did you call self.log('valid_f1', value) in the LightningModule?

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Epoch 0: 0%| | 3/3308 [00:21<6:39:12, 7.25s/it, loss=69.4, v_num=0_0]

markus-eberts commented 2 years ago

Hi, it seems you are running out of GPU memory -> RuntimeError: CUDA out of memory. Tried to allocate 450.00 MiB (GPU 0; 4.00 GiB total capacity; 2.46 GiB already allocated; 168.61 MiB free; 2.79 GiB reserved in total by PyTorch). Please have a look at issue #3 .

niningliuhen2013 commented 2 years ago

I switched to cpu later and the error also occurs.

Connected to pydev debugger (build 193.6494.30) F:/pythonproject/jerex-main/jerex_train.py:24: UserWarning: 'train' is validated against ConfigStore schema with the same name. This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2. See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions. train() F:\pythonproject\jerex-main\jerex\model.py:266: UserWarning: You set 'final_valid_evaluate=True' and specified a test path. The best model will be evaluated on the dataset specified in 'test_path'. warnings.warn("You set 'final_valid_evaluate=True' and specified a test path. " datasets: train_path: ./data/datasets/docred_joint/train_joint.json valid_path: ./data/datasets/docred_joint/dev_joint.json test_path: ./data/datasets/docred_joint/test_joint.json types_path: ./data/datasets/docred_joint/types.json model: model_type: joint_multi_instance encoder_path: bert-base-cased tokenizer_path: bert-base-cased mention_threshold: 0.85 coref_threshold: 0.85 rel_threshold: 0.6 prop_drop: 0.1 meta_embedding_size: 25 size_embeddings_count: 30 ed_embeddings_count: 300 token_dist_embeddings_count: 700 sentence_dist_embeddings_count: 50 position_embeddings_count: 700 sampling: neg_mention_count: 200 neg_coref_count: 200 neg_relation_count: 200 max_span_size: 10 sampling_processes: 8 neg_mention_overlap_ratio: 0.5 lowercase: false loss: mention_weight: 1.0 coref_weight: 1.0 entity_weight: 0.25 relation_weight: 1.0 inference: valid_batch_size: 1 test_batch_size: 1 max_spans: null max_coref_pairs: null max_rel_pairs: null training: batch_size: 1 min_epochs: 20 max_epochs: 20 lr: 5.0e-05 lr_warmup: 0.1 weight_decay: 0.01 max_grad_norm: 1.0 accumulate_grad_batches: 1 max_spans: null max_coref_pairs: null max_rel_pairs: null distribution: gpus: [] accelerator: '' prepare_data_per_node: false misc: store_predictions: true store_examples: true flush_logs_every_n_steps: 1000 log_every_n_steps: 1000 deterministic: false seed: null cache_path: null precision: 32 profiler: null final_valid_evaluate: true

[2021-12-30 12:10:55,779][numexpr.utils][INFO] - NumExpr defaulting to 6 threads. Parse dataset 'F:\pythonproject\jerex-main\data\datasets\docred_joint\train_joint.json': 100%|██████████| 3008/3008 [02:53<00:00, 17.31it/s] Parse dataset 'F:\pythonproject\jerex-main\data\datasets\docred_joint\dev_joint.json': 100%|██████████| 300/300 [00:16<00:00, 17.97it/s] Some weights of the model checkpoint at bert-base-cased were not used when initializing JointMultiInstanceModel: ['cls.predictions.transform.LayerNorm.bias', 'bert.pooler.dense.bias', 'cls.predictions.decoder.weight', 'bert.pooler.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra_internal\utils.py", line 378, in _run_hydra lambda: hydra.run( File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra_internal\utils.py", line 214, in run_and_report raise ex File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra_internal\utils.py", line 211, in run_andreport return func() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra_internal\utils.py", line 381, in overrides=args.overrides, File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra_internal\hydra.py", line 111, in run = ret.return_value File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra\core\utils.py", line 233, in return_value raise self._return_value File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\hydra\core\utils.py", line 160, in run_job ret.return_value = task_function(task_cfg) File "F:/pythonproject/jerex-main/jerex_train.py", line 20, in train model.train(cfg) File "F:\pythonproject\jerex-main\jerex\model.py", line 341, in train trainer.fit(model, datamodule=data_module) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 499, in fit self.dispatch() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 546, in dispatch self.accelerator.start_training(self) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 670, in run_train self.train_loop.on_train_end() File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 134, in on_train_end self.check_checkpoint_callback(should_update=True, is_last=True) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 164, in check_checkpoint_callback cb.on_validation_end(self.trainer, model) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 212, in on_validation_end self.save_checkpoint(trainer, pl_module) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 247, in save_checkpoint self._validate_monitor_key(trainer) File "C:\Users\Administrator.conda\envs\torchGPU\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 495, in _validate_monitor_key raise MisconfigurationException(m) pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='valid_f1') not found in the returned metrics: ['train_mention_loss', 'train_coref_loss', 'train_entity_loss', 'train_rel_loss', 'train_loss']. HINT: Did you call self.log('valid_f1', value) in the LightningModule?

markus-eberts commented 2 years ago

This is not the same exception as before ( -> The expanded size of the tensor (562) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 562]. Tensor sizes: [1, 512]). I'm a bit puzzled about this one. Are you using the same library versions as specified in README?

niningliuhen2013 commented 2 years ago

I really didn't use the same library versions. But i run it successfully when using the following library versions: torch==1.8.1 numpy==1.18.1 pytorch-lightning==1.2.7 scikit-learn==1.0.2 tqdm==4.43.0 hydra-core==1.0.6 transformers[sentencepiece]==4.5.1 Jinja2==2.11.3

niningliuhen2013 commented 2 years ago

Hi, I have another question to ask. How many epochs were run to get the results in the paper? I am confused how to get the result in Table 3. If you just need to train "relationship_classification_multi_instance" module on the original docred dataset to get the results of Table 3, then what is the use of "mention_localization", "coreference_resolution" and "entity_classification" for "relationship_classification_multi_instance"? I am also confused how to get the result in Table 4. Because it is not the same as the results in Table 1. Looking forward to your reply, thanks!

markus-eberts commented 2 years ago

Hi, you need to use "mention_localization", "coreference_resolution", "entity_classification" and "relationship_classification_multi_instance" to get the results of Table 4 (right row - separate models). To compute the left row of Table 4 (joint model), you need to train the joint multi instance model and evaluate it using the four models/tasks mentioned above (this way, the jointly trained model is evaluated on separate tasks given ground truth annotations from previous steps as explained in the Paper). And for Table 3, you should use the "relationship_classification_multi_instance" model on the original split and submit the results via Codalab.

niningliuhen2013 commented 2 years ago

Thank you so much, I got it!