Replicating Best Model Results

wgantt commented 2 years ago

Hi again @pitrack.

I recently tried to train your model using the default configuration supplied here on segments of 512 tokens. It certainly seems like the configuration provided there matches the best settings described in the paper (and I am loading weights from SpanBERT), but the results I've obtained are substantially below what's reported: final MUC, CEAF-E, and B^3 F1 on dev (for segments of length 512) are about 68.6, 48.7, and 44.8, respectively. I'm wondering if either:

The configuration provided is not in fact the one used for the best model.
There are additional steps I need to follow to more closely match your results.

Thanks!

pitrack commented 2 years ago

It's possible that there's some misconfiguration in either the training setup or the evaluation code. Could you share the log file (and predictions file)? If it's inconvenient to do in a github reply or if you're worried about sharing OntoNotes documents online, feel free to email me instead.

wgantt commented 2 years ago

Sure thing — here you go. coref_results.zip .

pitrack commented 2 years ago

Thanks. I'm not seeing anything strikingly different between your config, the one in the repo, and the one I used to train the checkpoint. The only difference is the encoder_learning_rate. In my checkpoint, I had 1e-05, but this value should be ignored because the encoder shouldn't be attached to the computation graph. I also don't see anything immediately wrong with the predictions file.

(In the future, to save time, it should hit around 77 to 78-ish after the first epoch with loss around 10-15... and if it isn't then something is probably wrong.)

I'll try training again based on this repo and let you know if I get the same thing you're getting. There's a chance somewhere in the refactoring/code release last year, a bug was introduced.

Edit: I can confirm that I'm getting similarly low numbers as you are with the default config after one epoch of training. I'll look into this more tomorrow.

wgantt commented 2 years ago

Thanks a lot, Patrick! I really appreciate it.

pitrack commented 2 years ago

I think the encoder was not converted properly and is essentially randomly initialized. Since the encoder was overwritten in inference, the inference/loading looked okay. Interestingly, this means that around 50F1 is how well a coref model can be trained with a randomly initialized, frozen encoder, which is quite an interesting result in itself since that seems high to me.

The easy fix: download pytorch_model.bin from https://huggingface.co/shtoshni/spanbert_coreference_large/blob/main/pytorch_model.bin and replace pytorch_model.bin with this. The md5sum file hash doesn't match perfectly, but I trust that it was converted correctly (and it has been what I've been using since).

The DIY fix (which is what I did and then forgot about): go to the transformers library code site-packages/transformers/modeling_bert.py and add something like the following at around L113 (see link for exact location).

if "bert" not in name:
    continue

Training and evaluating on just the first 100 examples results in 71.7, and training after one full epoch (and eval on the full dev set) should get something in the 78s.

If this works for you, let me know so I can update the instructions in the README.

wgantt commented 2 years ago

Thanks @pitrack. I'll give this a try and report back.

wgantt commented 2 years ago

I can confirm that my results using the updated checkpoint match yours (both for the full dataset and for the first 100 examples). I haven't tried the DIY fix, but I'm thinking it would be better just to tell people to update the checkpoint anyway. Bit less janky, to my mind.

EDIT: feel free to close this issue once you udpate the README.

pitrack commented 2 years ago

Thanks for helping out!

wgantt commented 2 years ago

Thanks for the assistance!

pitrack / incremental-coref

Replicating Best Model Results #4