can't run evaluate.py successfully

HaixiaChai commented 5 years ago

Hello,

After creating test jsonlines files and changing eval_path to one of the test files in experiments.conf, I only can run bert_base with test.english.128.jsonlines and the F1 score is 73.38, others(only test on bert_base and bert_large) were all failed by tensorflow.python.framework.errors_impl.OutOfRangeError: Read less bytes than requested [[{{node checkpoint_initializer_337}}]] [[{{node checkpoint_initializer_3}}]]

could you help me to have check what's wrong on it? If current repo can run successfully, please?

Many thanks~

mandarjoshi90 commented 5 years ago

Hi Haixia, From what I understand you were able to run the base model and not the large one. Is that correct? What kind of GPU do you have? From the README --

Finetuning a BERT/SpanBERT large model on OntoNotes requires access to a 32GB GPU. You might be able to train the large model with a smaller max_seq_length, max_training_sentences, ffnn_size, and model_heads = false on a 16GB machine; this will almost certainly result in relatively poorer performance as measured on OntoNotes.

Running/testing a large pretrained model is still possible on a 16GB GPU. You should be able to finetune the base models on smaller GPUs.

Basically, you can't run the large model on 12GB GPUs. But if you use SpanBERT base, you should get better numbers than BERT-large, so I'd recommend using that.

HaixiaChai commented 5 years ago

Hello Mandar,

Thank you for your reply. With the same env, now I have evaluated bert_base, spanbert_base, and spanbert_large successfully. But I got the same error as above on bert_large.

Then, I checked the model and found there is no checkpoint file but the other three models have it. Also, the error message mentioned checkpoint something wrong. So I guess the fail reason maybe is the checkpoint file missing.

Could you help me to have a check if it's true or some other problems caused, please?

mandarjoshi90 commented 5 years ago

It seems that the model file got corrupted somehow. I've replaced the model with a newer but equivalent weight file. Could you please download bert_large again and check?

HaixiaChai commented 5 years ago

I download bert_large model again, then found there are another two files missing, bert_config.json and vocab.txt. Could you add them in, please? thanks a lot.

mandarjoshi90 commented 5 years ago

Done. Also changed the directory name to bert_large

HaixiaChai commented 5 years ago

Thanks. I can run bert_large successfully now. But the avg F1 is two points lower than the number you reported. Could you kindly have a check if the uploaded model is the best version, please? Because of the other three models, I have similar results with you.

mandarjoshi90 commented 5 years ago

Is this OntoNotes? Can you share the exact results with the logs please? I can't run large models myself on my local machine. I only have access to the logs which indicate same dev set performance on OntoNotes (77.35 F1).

HaixiaChai commented 5 years ago

ok. I am evaluating the test set on OntoNotes for bert_large model. The avg F1 score that I got is 74.92, and you reported is 76.9.

mandarjoshi90 commented 5 years ago

I'll try to find out what's going on from my coauthors who have access larger machines. Please give me a few days. Meanwhile, could you check the dev numbers. It should be 77.35 according to the logs.

HaixiaChai commented 5 years ago

No problem, take your time. I have run the bert_large model on the dev dataset, and the F1 score is 75.13.

mandarjoshi90 commented 4 years ago

This is quite odd that your F1 is two points lower. My coauthors were able to reproduce the results on the FB machines albeit with an older version of the code. Are you seeing a 2 point drop for SpanBERT large as well?

HaixiaChai commented 4 years ago

No, the F1 result of SpanBert_large model on the test set is similar to yours, no problem with it. So, what's the difference between the older version and the current online version? Could you evaluate the bert_large model you uploaded a few days ago to check if the F1 score is correct?

HaixiaChai commented 4 years ago

Sorry, I just found that I used 128 jason file, should be 384. Sorry for the interruption and thank you for your help.

mandarjoshi90 / coref

can't run evaluate.py successfully #18