veronica320 / Zeroshot-Event-Extraction

Repository for ACL2021 paper: <Zero-shot Event Extraction via Transfer Learning: Challenges and Insights>.
29 stars 6 forks source link

Potential bug that causes IndexError in inference process #3

Open btyu opened 2 years ago

btyu commented 2 years ago

Hi! Thank you for your excellent work and the well-organized codebase.

I am trying to run up the inference pipeline, however it raises an IndexError from source/utils/srl.py. Also, I met another IndexError in the SRL process and skipped it by changing the SRL code. I am not sure whether the first one has something to do with the latter one, and also not sure whether I did the right way to obtain the SRL result. The following is the detail.

IndexError in Event Extraction

I run source/predict_evaluate.py on the test set, and get the following IndexError:

109 Jalal Jamil, a 45-year-old jewellery store owner, said the situation just keeps getting worse.
Traceback (most recent call last):
  File "source/predict_evaluate.py", line 71, in <module>
    pred_events = model.predict(instance)
  File "/data/Git_base/Zeroshot-Event-Extraction/source/model.py", line 128, in predict
    self.srl_consts)
  File "/data/Git_base/Zeroshot-Event-Extraction/source/utils/srl.py", line 152, in get_srl_results
    text_piece = ' '.join([verb_srl_tokens[i] for i, tag in enumerate(res['tags']) if
  File "/data/Git_base/Zeroshot-Event-Extraction/source/utils/srl.py", line 153, in <listcomp>
    overlap(tag, srl_consts_for_trg)])
IndexError: list index out of range

And this is the error line: https://github.com/veronica320/Zeroshot-Event-Extraction/blob/24bb003a31827f41367daf5cffe0b4521d741da3/source/utils/srl.py#L153

I guess it is a bug? For your convenience, these are the corresponding files that only contain the problematic sample sample.zip. I am not sure whether the problem is related to another IndexError I met in SRL that I will depict in the next section.

IndexError in SRL

I use the SRL code you refered to, and these are the commands to process the samples with nominal_sense_srl and verb_sense_srl respectively.

allennlp predict nom-sense-srl/model.tar.gz ../Zeroshot-Event-Extraction/data/ACE_converted/test.event.json --output-file ../Zeroshot-Event-Extraction/data/SRL_output/nomSRL_test.jsonl --predictor "all-nombank-sense-srl" --include-package nominal_sense_srl

allennlp predict verb-sense-srl/model.tar.gz ../Zeroshot-Event-Extraction/data/ACE_converted/test.event.json --output-file ../Zeroshot-Event-Extraction/data/SRL_output/verbSRL_test.jsonl --predictor "sense-semantic-role-labeling" --include-package verb_sense_srl

I am not quite sure whether the above commands are right, so please inform me if they are not. The nominal_sense_srl works well, and the other fails with the following IndexError:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/zsee/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/opt/anaconda3/envs/zsee/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/opt/anaconda3/envs/zsee/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/opt/anaconda3/envs/zsee/lib/python3.7/site-packages/allennlp/commands/predict.py", line 227, in _predict
    manager.run()
  File "/opt/anaconda3/envs/zsee/lib/python3.7/site-packages/allennlp/commands/predict.py", line 206, in run
    for model_input_json, result in zip(batch_json, self._predict_json(batch_json)):
  File "/opt/anaconda3/envs/zsee/lib/python3.7/site-packages/allennlp/commands/predict.py", line 151, in _predict_json
    results = [self._predictor.predict_json(batch_data[0])]
  File "./verb_sense_srl/predictor.py", line 257, in predict_json
    instances = self._sentence_to_srl_instances(inputs)
  File "./verb_sense_srl/predictor.py", line 141, in _sentence_to_srl_instances
    return self.tokens_to_instances(tokens)
  File "./verb_sense_srl/predictor.py", line 112, in tokens_to_instances
    instance = self._dataset_reader.text_to_instance(tokens, verb_labels)
  File "./verb_sense_srl/reader.py", line 342, in text_to_instance
    verb = tokens[verb_index].text
IndexError: list index out of range
2022-04-15 16:01:51,462 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmp1ajdnkx4

And this is the error line: https://github.com/CogComp/SRL-English/blob/33278a6590b9dd6652a7ed55cc313b42c3fb3f2a/verb_sense_srl/reader.py#L342

I noticed that the sample index is 378, which is not the same one that causes the IndexError in EE, so it is not likely that the first IndexError is related to this one. There are altogether three samples in the test set that cause the error. I skipped this error by setting verb_index and verb to None if IndexError is raised, but not sure what side effect it will bring.

Could you please check the possible bugs above? Thank you!

btyu commented 2 years ago

Looking forward to your reply if you get time! Thank you!

veronica320 commented 2 years ago

Hi, thanks for your interest and so sorry for the long wait! I was occupied with something else in the past few weeks. I looked at the error, and I think both errors are because of a bug in CogComp/SRL-English.

Specifically, the first error is actually at this line (about the iteration, not the overlap() function):

text_piece = ' '.join([verb_srl_tokens[i] for i, tag in enumerate(res['tags']) if overlap(tag, srl_consts_for_trg)])

Here it's basically concatenating tokens in the input sentence if their SRL tag is in some pre-specified set (in the srl_consts field in config.json). So it expects that the number of tokens (verb_srl_tokens) and the number of SRL tags (res['tags']) should be the same. But they aren't the same for this example, if printed out:

print(len(verb_srl_tokens))
print(len(res["tags"]))

Output:

19
21

Both variables come from the SRL output. verb_srl_tokens is the "words" field:

"words": ["Jalal", "Jamil", ",", "a", "45-year", "-", "old", "jewellery", "store", "owner", ",", "said", "the", "situation", "just", "keeps", "getting", "worse", "."]

and res["tags"] is the "tags" field in the first element of the "verbs" list ("verb": "situation"):

"tags": ["B-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "I-ARG0", "B-V", "B-ARG1", "I-ARG1", "I-ARG1", "I-ARG1", "I-ARG1", "I-ARG1", "O"]

Our code expects that these two lists should be of the same length, but here it's not the case in the SRL output. That's what caused the error.

We apologize for the inconvenience, but it seems that CogComp/SRL-English has undergone some changes since we published our paper. Since it's not maintained by us, could you please raise this issue (basically, different lengths of "tags" and "words" in the output) in that repo directly?

Thanks for your understanding and please let me know if you have other questions.

evelinamorim commented 1 year ago

I changed the SRL code. However, I am not sure if the change is correct, so I reported here the issue and modification I made.

veronica320 commented 1 year ago

Hi @evelinamorim, thanks for your interest! While waiting for the response from the SRL authors, you are welcome to look at the updated notes in our README on how to resolve a known inconsistency from the SRL system (please refer to "UPDATE (10/28/2022)" under "Getting the SRL output"), and see if this helps with your issue.