lavis-nlp / spert

PyTorch code for SpERT: Span-based Entity and Relation Transformer
MIT License
685 stars 146 forks source link

How to preprocess data? #27

Closed ChloeJKim closed 3 years ago

ChloeJKim commented 4 years ago

Hi @markus-eberts

Thank you for publishing your work! I really enjoyed reading your paper and want to use your model on predicting on my dataset and was wondering if you can provide a script that pre-processes raw ADE data to these data you mentioned "ade_split_1_train.json" / "ade_split_1_test.json") under "data/datasets/ade". that goes into the ADE model

It would be great if you can provide what script you use to turn original ADE datasets to input data in ADE model

Thank you!

markus-eberts commented 4 years ago

Hi,

thanks :)! I just added the conversion scripts in commit 0032c8fc46d1fd6966ee32db37c56e2884bd0d78. You can use the 'convert_ade.py' script to convert the ADE dataset to our input format. The 'adesplit*' files are random splits for 10-fold cross validation.

ChloeJKim commented 4 years ago

Thank you for pushing the convert_ade.py script! To have better understanding of the conversion from raw data to ade_split files Could you also provide the ade raw data? I want to know how the exact input looks like so I can modify my dataset accordingly

I would really appreciate your answer :)

markus-eberts commented 4 years ago

See here or here for the raw ADE dataset ('DRUG-AE.rel' in 'ADE-Corpus-V2.zip').

ChloeJKim commented 4 years ago

Thank you for your explanation! @markus-eberts Quick another question, how long did it take you to train,evaluate ADE model?

Thanks! Chloe

ChloeJKim commented 4 years ago

I wanted to try out& train model on conll04 and encountered this error: (see below) Can you please what I can do to fix this? Thanks in advance

(spert) kimc26@nl002:~/spert % python ./spert.py train --config configs/example_ train.conf

Config: {'label': 'conll04_train', 'model_type': 'spert', 'model_path': 'bert-base-cased ', 'tokenizer_path': 'bert-base-cased', 'train_path': 'data/datasets/conll04/con ll04_train.json', 'valid_path': 'data/datasets/conll04/conll04_dev.json', 'types _path': 'data/datasets/conll04/conll04_types.json', 'train_batch_size': '2', 'ev al_batch_size': '1', 'neg_entity_count': '100', 'neg_relation_count': '100', 'ep ochs': '20', 'lr': '5e-5', 'lr_warmup': '0.1', 'weight_decay': '0.01', 'max_grad _norm': '1.0', 'rel_filter_threshold': '0.4', 'size_embedding': '25', 'prop_drop ': '0.1', 'max_span_size': '10', 'store_predictions': 'true', 'store_examples': 'true', 'sampling_processes': '4', 'sampling_limit': '100', 'max_pairs': '1000', 'final_eval': 'true', 'log_path': 'data/log/', 'save_path': 'data/save/'} Repeat 1 times

Iteration 0

2020-09-21 14:59:56,999 [MainThread ] [INFO ] https://s3.amazonaws.com/models. huggingface.co/bert/bert-base-cased-vocab.txt not found in cache or force_downlo ad set to True, downloading to /local/tmp/tmpa4reps1l 100%|████████████████████████████████| 213450/213450 [00:00<00:00, 905436.81B/s] 2020-09-21 14:59:57,600 [MainThread ] [INFO ] copying /local/tmp/tmpa4reps1l t o cache at /gstore/home/kimc26/.cache/torch/transformers/5e8a2b4893d13790ed4150c a1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f2774 6c6b526f6352861b1980eb80b1 2020-09-21 14:59:57,603 [MainThread ] [INFO ] creating metadata file for /gsto re/home/kimc26/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c 4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b 1980eb80b1 2020-09-21 14:59:57,605 [MainThread ] [INFO ] removing temp file /local/tmp/tm pa4reps1l 2020-09-21 14:59:57,605 [MainThread ] [INFO ] loading file https://s3.amazonaw s.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /gstore /home/kimc26/.cache/torch/transformers/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4d df62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b19 80eb80b1 2020-09-21 14:59:57,668 [MainThread ] [INFO ] Datasets: data/datasets/conll04/ conll04_train.json, data/datasets/conll04/conll04_dev.json 2020-09-21 14:59:57,669 [MainThread ] [INFO ] Model type: spert Parse dataset 'train': 100%|█████████████████| 922/922 [00:02<00:00, 316.34it/s] Parse dataset 'valid': 100%|█████████████████| 231/231 [00:00<00:00, 464.66it/s] 2020-09-21 15:00:01,153 [MainThread ] [INFO ] Relation type count: 6 2020-09-21 15:00:01,153 [MainThread ] [INFO ] Entity type count: 5 2020-09-21 15:00:01,153 [MainThread ] [INFO ] Entities: 2020-09-21 15:00:01,153 [MainThread ] [INFO ] No Entity=0 2020-09-21 15:00:01,153 [MainThread ] [INFO ] Location=1 2020-09-21 15:00:01,154 [MainThread ] [INFO ] Organization=2 2020-09-21 15:00:01,154 [MainThread ] [INFO ] People=3 2020-09-21 15:00:01,154 [MainThread ] [INFO ] Other=4 2020-09-21 15:00:01,154 [MainThread ] [INFO ] Relations: 2020-09-21 15:00:01,154 [MainThread ] [INFO ] No Relation=0 2020-09-21 15:00:01,154 [MainThread ] [INFO ] Work for=1 2020-09-21 15:00:01,154 [MainThread ] [INFO ] Kill=2 2020-09-21 15:00:01,154 [MainThread ] [INFO ] Organization based in=3 2020-09-21 15:00:01,154 [MainThread ] [INFO ] Live in=4 2020-09-21 15:00:01,154 [MainThread ] [INFO ] Located in=5 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Dataset: train 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Document count: 922 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Relation count: 1283 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Entity count: 3377 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Dataset: valid 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Document count: 231 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Relation count: 343 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Entity count: 893 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Context size: 150 2020-09-21 15:00:01,155 [MainThread ] [INFO ] Updates per epoch: 461 2020-09-21 15:00:01,156 [MainThread ] [INFO ] Updates total: 9220 2020-09-21 15:00:01,515 [MainThread ] [INFO ] https://s3.amazonaws.com/models. huggingface.co/bert/bert-base-cased-config.json not found in cache or force_down load set to True, downloading to /local/tmp/tmp3vr_rndt 100%|██████████████████████████████████████| 433/433 [00:00<00:00, 193741.59B/s] 2020-09-21 15:00:01,869 [MainThread ] [INFO ] copying /local/tmp/tmp3vr_rndt t o cache at /gstore/home/kimc26/.cache/torch/transformers/b945b69218e98b3e2c95acf 911789741307dec43c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca 38b877837e52618a67d7002391 2020-09-21 15:00:01,870 [MainThread ] [INFO ] creating metadata file for /gsto re/home/kimc26/.cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec4 3c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca38b877837e52618a 67d7002391 2020-09-21 15:00:01,871 [MainThread ] [INFO ] removing temp file /local/tmp/tm p3vr_rndt 2020-09-21 15:00:01,873 [MainThread ] [INFO ] loading configuration file https ://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at /gstore/home/kimc26/.cache/torch/transformers/b945b69218e98b3e2c95acf91 1789741307dec43c698d35fad11c1ae28bda352.9da767be51e1327499df13488672789394e2ca38 b877837e52618a67d7002391 2020-09-21 15:00:01,874 [MainThread ] [INFO ] Model config { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "finetuning_task": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "num_labels": 2, "output_attentions": false, "output_hidden_states": false, "output_past": true, "pad_token_id": 0, "pruned_heads": {}, "torchscript": false, "type_vocab_size": 2, "use_bfloat16": false, "vocab_size": 28996 }

2020-09-21 15:00:02,222 [MainThread ] [INFO ] https://s3.amazonaws.com/models. huggingface.co/bert/bert-base-cased-pytorch_model.bin not found in cache or forc e_download set to True, downloading to /local/tmp/tmpkjbim48c 100%|████████████████████████| 435779157/435779157 [00:08<00:00, 48749401.33B/s] 2020-09-21 15:00:11,651 [MainThread ] [INFO ] copying /local/tmp/tmpkjbim48c t o cache at /gstore/home/kimc26/.cache/torch/transformers/35d8b9d36faaf46728a0192 d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d742 9f12d543d80bfc3ad70de55ac2 2020-09-21 15:00:12,267 [MainThread ] [INFO ] creating metadata file for /gsto re/home/kimc26/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490c d6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3a d70de55ac2 2020-09-21 15:00:12,269 [MainThread ] [INFO ] removing temp file /local/tmp/tm pkjbim48c 2020-09-21 15:00:12,410 [MainThread ] [INFO ] loading weights file https://s3. amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin from cache at /gstore/home/kimc26/.cache/torch/transformers/35d8b9d36faaf46728a0192d8 2bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f 12d543d80bfc3ad70de55ac2 2020-09-21 15:00:18,350 [MainThread ] [INFO ] Weights of SpERT not initialized from pretrained model: ['rel_classifier.weight', 'rel_classifier.bias', 'entity _classifier.weight', 'entity_classifier.bias', 'size_embeddings.weight'] 2020-09-21 15:00:18,350 [MainThread ] [INFO ] Weights from pretrained model no t used in SpERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weigh t', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'c ls.seq_relationship.weight', 'cls.seqrelationship.bias', 'cls.predictions.trans form.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias'] 2020-09-21 15:00:18,360 [MainThread ] [INFO ] Train epoch: 0 Train epoch 0: 0%| | 0/461 [00:00<?, ?it/s] /gstore/home/kimc26/.conda/envs/spert/lib/python3.8/site-packages/transformers/o ptimization.py:146: UserWarning: This overload of add is deprecated: add(Number alpha, Tensor other) Consider using one of the following signatures instead: add(Tensor other, , Number alpha) (Triggered internally at /opt/conda /conda-bld/pytorch_1595629411241/work/torch/csrc/utils/python_arg_parser.cpp:766 .) expavg.mul(beta1).add_(1.0 - beta1, grad) Train epoch 0: 0%| | 0/461 [02:43<?, ?it/s] Process ForkProcess-1: Traceback (most recent call last): File "/gstore/home/kimc26/.conda/envs/spert/lib/python3.8/multiprocessing/proc ess.py", line 315, in _bootstrap self.run() File "/gstore/home/kimc26/.conda/envs/spert/lib/python3.8/multiprocessing/proc ess.py", line 108, in run self._target(self._args, *self._kwargs) File "./spert.py", line 11, in __train trainer.train(train_path=run_args.train_path, valid_path=run_args.valid_path , File "/gstore/home/kimc26/spert/spert/spert_trainer.py", line 116, in train self._train_epoch(model, compute_loss, optimizer, train_dataset, updates_epo ch, epoch) File "/gstore/home/kimc26/spert/spert/spert_trainer.py", line 197, in _train_e poch batch_loss = compute_loss.compute(entity_logits=entity_logits, rel_logits=re l_logits, File "/gstore/home/kimc26/spert/spert/loss.py", line 49, in compute self._optimizer.step() File "/gstore/home/kimc26/.conda/envs/spert/lib/python3.8/site-packages/torch/ optim/lr_scheduler.py", line 67, in wrapper return wrapped(args, kwargs) File "/gstore/home/kimc26/.conda/envs/spert/lib/python3.8/site-packages/transf ormers/optimization.py", line 147, in step exp_avgsq.mul(beta2).addcmul_(1.0 - beta2, grad, grad) File "/gstore/home/kimc26/.conda/envs/spert/lib/python3.8/site-packages/torch/ utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 5311) is killed by signal: Killed.** (spert) kimc26@nl002:~/spert %

markus-eberts commented 4 years ago

Looks like you are running out of memory. How much memory does your computer have and are you training on CPU or GPU?

ChloeJKim commented 4 years ago

with using GPU this figured my problem! I'm really appreciating your fast reply @markus-eberts If you don't mind, can you also provide a link to conll04 & scierc raw dataset?

Thank you :)

markus-eberts commented 4 years ago

There you go: CoNLL04 SciERC (processed dataset)

ChloeJKim commented 4 years ago

@markus-eberts Thank you for providing the dataset.

can I ask how you did random splits for 10-fold cross-validation on ade dataset and also why you did 10-fold cross validation on ADE dataset while leaving other datasets (conll04, scierec) just into train, test, dev, train_dev?

I want to re-train ade model w/ my own dataset and was wondering whether I had to resemble CoNLL04 example_train.conf. for ADE

image

like this (below), should I keep the parameters and change train,valid path with random splits or just train, dev.json files?

[1] label = ade_train model_type = spert model_path = bert-base-cased tokenizer_path = bert-base-cased train_path = data/datasets/ade/ade_train.json #change this? valid_path = data/datasets/ade/ade_dev.json #change this? types_path = data/datasets/ade/ade_types.json train_batch_size = 2 eval_batch_size = 1 neg_entity_count = 100 neg_relation_count = 100 epochs = 20 lr = 5e-5 lr_warmup = 0.1 weight_decay = 0.01 max_grad_norm = 1.0 rel_filter_threshold = 0.4 size_embedding = 25 prop_drop = 0.1 max_span_size = 10 store_predictions = true store_examples = true sampling_processes = 4 sampling_limit = 100 max_pairs = 1000 final_eval = true log_path = data/log/ save_path = data/save/

Thank you!

markus-eberts commented 3 years ago

Clarified via e-mail.