mandarjoshi90 / coref

BERT for Coreference Resolution
Apache License 2.0
445 stars 93 forks source link

Questions on Table 2 of Bert paper #19

Open HaixiaChai opened 5 years ago

HaixiaChai commented 5 years ago
  1. Table 2 shows many systems results on GAP, could I ask it is on GAP dev dataset or test dataset?
  2. I couldn't reproduce c2f_coref result now, not sure what' wrong with files or parameters. I am wondering if you also use gap_to_jsonlines.py and to_gap_tsv.py for c2f_coref system? Do you use tokenizer or not in gap_to_jsonlines.py? And what doc_key do you set for each sample in JSON? because it is required from one of the genres.

Thank you in advance.

mandarjoshi90 commented 5 years ago

Sorry about the late response. Here's the pipleline. $gap_file_prefix points to the path of the GAP file without the tsv prefix. $vocab_file refers to the cased BERT vocab file.

#!/bin/bash
gap_file_prefix=$1
vocab_file=$2
python gap_to_jsonlines.py $gap_file_prefix.tsv $vocab_file
GPU=0 python predict.py bert_base $gap_file_prefix.jsonlines $gap_file_prefix.output.jsonlines
python to_gap_tsv.py $gap_file_prefix.output.jsonlines
python2 ../gap-coreference/gap_scorer.py --gold_tsv $gap_file_prefix.tsv --system_tsv $gap_file_prefix.output.tsv
  1. Table 2 is on test.
  2. The results seem to be off by 0.3 or so for BERT base. Not sure what changed. The genre has very little effect (upto 0.1 IIRC) on the number. I got to 82.4 with the default genre (bc).
HaixiaChai commented 5 years ago
  1. I found all 4 numbers of e2e-coref on the first row are exactly the same as the results in the last row of Table 4 in the paper of Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns. But, they said the results are on GAP development set. I think the probability is very low that dev set and test results are totally the same. So could you make sure if results in Table 2 surely are on GAP test set, please?
  2. Thank you for your pipeline and bert_base result. Actually, I also got Overall score of 82.4. It is ok. However, my question is on c2f_coref model. The pipeline could be the same, but the codes should be slightly different for adapting to c2f_coref. Can you reproduce the 4 numbers of c2f-coref model?

Thanks a lot.

mandarjoshi90 commented 5 years ago
  1. I did not run the e2e-coref model. Looks like we copied from the wrong table for that row. I will amend the paper. We definitely evaluated on the test set for BERT.
  2. I don't have that handy right now, and I'm traveling until mid November. IIRC the only change should be to make sure that each element of the sentences field should be a natural language sentence (as opposed to a paragraph as with bert). This is because c2f-coref contextualizes each sentence independently with LSTMs.

If that doesn't work, I'll take a look after I'm back. Thanks for your patience.

HaixiaChai commented 5 years ago
  1. Because gap_to_jsonlines.py file is compatible with tokenizer with None, so I used it. The Overall F1 score I evaluated is 68.5, but not 73.5 on your paper. If you can reproduce it again to have a check on what codes you used, I will be appreciated so much.
Hafsa-Masroor commented 4 years ago

@HaixiaChai Could you please share the detailed steps to test & evaluate this model using GAP data-set? (Want to know what changes were made for environmental setup, commands, data, etc) I am new to this research area, and want to re-produce the results with both GAP & Onto-notes data-sets. Your valuable help will be appreciated in this regard.

Thanks!