Closed wgantt closed 2 years ago
It's possible that there's some misconfiguration in either the training setup or the evaluation code. Could you share the log file (and predictions file)? If it's inconvenient to do in a github reply or if you're worried about sharing OntoNotes documents online, feel free to email me instead.
Sure thing — here you go. coref_results.zip .
Thanks. I'm not seeing anything strikingly different between your config, the one in the repo, and the one I used to train the checkpoint. The only difference is the encoder_learning_rate
. In my checkpoint, I had 1e-05, but this value should be ignored because the encoder shouldn't be attached to the computation graph. I also don't see anything immediately wrong with the predictions file.
(In the future, to save time, it should hit around 77 to 78-ish after the first epoch with loss around 10-15... and if it isn't then something is probably wrong.)
I'll try training again based on this repo and let you know if I get the same thing you're getting. There's a chance somewhere in the refactoring/code release last year, a bug was introduced.
Edit: I can confirm that I'm getting similarly low numbers as you are with the default config after one epoch of training. I'll look into this more tomorrow.
Thanks a lot, Patrick! I really appreciate it.
I think the encoder was not converted properly and is essentially randomly initialized. Since the encoder was overwritten in inference, the inference/loading looked okay. Interestingly, this means that around 50F1 is how well a coref model can be trained with a randomly initialized, frozen encoder, which is quite an interesting result in itself since that seems high to me.
The easy fix: download pytorch_model.bin from https://huggingface.co/shtoshni/spanbert_coreference_large/blob/main/pytorch_model.bin and replace pytorch_model.bin with this. The md5sum file hash doesn't match perfectly, but I trust that it was converted correctly (and it has been what I've been using since).
The DIY fix (which is what I did and then forgot about): go to the transformers library code site-packages/transformers/modeling_bert.py
and add something like the following at around L113 (see link for exact location).
if "bert" not in name:
continue
Training and evaluating on just the first 100 examples results in 71.7, and training after one full epoch (and eval on the full dev set) should get something in the 78s.
If this works for you, let me know so I can update the instructions in the README.
Thanks @pitrack. I'll give this a try and report back.
I can confirm that my results using the updated checkpoint match yours (both for the full dataset and for the first 100 examples). I haven't tried the DIY fix, but I'm thinking it would be better just to tell people to update the checkpoint anyway. Bit less janky, to my mind.
EDIT: feel free to close this issue once you udpate the README.
Thanks for helping out!
Thanks for the assistance!
Hi again @pitrack.
I recently tried to train your model using the default configuration supplied here on segments of 512 tokens. It certainly seems like the configuration provided there matches the best settings described in the paper (and I am loading weights from SpanBERT), but the results I've obtained are substantially below what's reported: final MUC, CEAF-E, and B^3 F1 on dev (for segments of length 512) are about 68.6, 48.7, and 44.8, respectively. I'm wondering if either:
Thanks!