Closed msa30 closed 2 years ago
Same wonder. And why your trained 'roberta-teacher.pt' on dev set produce {'dev_F1': 3.8303985525712245, 'dev_F1_ign': 3.3577260304359418, 'dev_P': 1.9533883368502445, 'dev_R': 97.9633401221996}; 'roberta-kd-pretrained.pt' on dev set produce {'dev_F1': 4.342814133327975, 'dev_F1_ign': 3.810714189155657, 'dev_P': 2.2212172405029227, 'dev_R': 96.83095723014257}; 'roberta-continue-trained-best.pt' on dev set produce '{'dev_F1': 6.532345827453666, 'dev_F1_ign': 5.736407664182513, 'dev_P': 3.38170794261903, 'dev_R': 95.60081466395111}' ???
Thank you for your interest. I will look into each of your questions individually. @WatsonWangZh It seems your results are far off, may I know how did you produce this result?
I just change the load_path args to the named checkpoints and test with eval_roberta.sh script.
@WatsonWangZh the procedure is correct. However, the scores are weird, the recall scores are all very high. I re-pulled this repository and downloaded the models from the google drive. The results that I got are {'dev_F1': 58.13124661036646, 'dev_F1_ign': 55.552728278750585, 'dev_P': 55.61485434734267, 'dev_R': 60.88614785360708} for 'roberta-kd-pretrained.pt'. {'dev_F1': 64.07267598622806, 'dev_F1_ign': 62.04467547520846, 'dev_P': 65.53801765105227, 'dev_R': 62.671427412156135} for 'roberta-teacher.pt' and {'dev_F1': 67.14393668217794, 'dev_F1_ign': 65.09684178475986, 'dev_P': 65.84243369734789, 'dev_R': 68.4979306986935} for 'roberta-continue-trained-best.pt'. I am not sure that why are your evaluation results are so far off. It seems your result have very high recall and low precision. This indicate that the threshold class has low logits for all examples. Did you make any changes to the parameters that related to the threshold class, such as num_labels etc.
@WatsonWangZh the procedure is correct. However, the scores are weird, the recall scores are all very high. I re-pulled this repository and downloaded the models from the google drive. The results that I got are {'dev_F1': 58.13124661036646, 'dev_F1_ign': 55.552728278750585, 'dev_P': 55.61485434734267, 'dev_R': 60.88614785360708} for 'roberta-kd-pretrained.pt'. {'dev_F1': 64.07267598622806, 'dev_F1_ign': 62.04467547520846, 'dev_P': 65.53801765105227, 'dev_R': 62.671427412156135} for 'roberta-teacher.pt' and {'dev_F1': 67.14393668217794, 'dev_F1_ign': 65.09684178475986, 'dev_P': 65.84243369734789, 'dev_R': 68.4979306986935} for 'roberta-continue-trained-best.pt'. I am not sure that why are your evaluation results are so far off. It seems your result have very high recall and low precision. This indicate that the threshold class has low logits for all examples. Did you make any changes to the parameters that related to the threshold class, such as num_labels etc.
Yes, you're right. Thanks for your reply.
@locurryve I also met this same problem when I reproduced. I wonder if the random seed has a tiny difference in the result, especially in the process of knowledge distillation. I think the author should try different random seeds, then get the average result.
@locurryve I also met this same problem when I reproduced. I wonder if the random seed has a tiny difference in the result, especially in the process of knowledge distillation. I think the author should try different random seeds, then get the average result.
Maybe it is. I only try step 1 to train a teacher model and haven't tried all the training steps. In my experiments it seems that different random seeds bring some different results, but the margin between origin results without knowledge distillation and my average reproduced results in step 1 remains.
@locurryve I also met this same problem when I reproduced. I wonder if the random seed has a tiny difference in the result, especially in the process of knowledge distillation. I think the author should try different random seeds, then get the average result.
Maybe it is. I only try step 1 to train a teacher model and haven't tried all the training steps. In my experiments it seems that different random seeds bring some different results, but the margin between origin results without knowledge distillation and my average reproduced results in step 1 remains.
Emm, the student model may have a better performance on this dataset than the teacher model in a way. I once met this situation, and I think the student may not perform stably.
@locurryve I also met this same problem when I reproduced. I wonder if the random seed has a tiny difference in the result, especially in the process of knowledge distillation. I think the author should try different random seeds, then get the average result.
Maybe it is. I only try step 1 to train a teacher model and haven't tried all the training steps. In my experiments it seems that different random seeds bring some different results, but the margin between origin results without knowledge distillation and my average reproduced results in step 1 remains.
Emm, the student model may have a better performance on this dataset than the teacher model in a way. I once met this situation, and I think the student may not perform stably.
Do you evaluate the teacher model after step 1? Is the result the same as the original results without knowledge distillation in paper?
@locurryve I also met this same problem when I reproduced. I wonder if the random seed has a tiny difference in the result, especially in the process of knowledge distillation. I think the author should try different random seeds, then get the average result.
Maybe it is. I only try step 1 to train a teacher model and haven't tried all the training steps. In my experiments it seems that different random seeds bring some different results, but the margin between origin results without knowledge distillation and my average reproduced results in step 1 remains.
Emm, the student model may have a better performance on this dataset than the teacher model in a way. I once met this situation, and I think the student may not perform stably.
Do you evaluate the teacher model after step 1? Is the result the same as the original results without knowledge distillation in paper?
I just evaluated the last student model
@locurryve I also met this same problem when I reproduced. I wonder if the random seed has a tiny difference in the result, especially in the process of knowledge distillation. I think the author should try different random seeds, then get the average result.
Maybe it is. I only try step 1 to train a teacher model and haven't tried all the training steps. In my experiments it seems that different random seeds bring some different results, but the margin between origin results without knowledge distillation and my average reproduced results in step 1 remains.
Emm, the student model may have a better performance on this dataset than the teacher model in a way. I once met this situation, and I think the student may not perform stably.
Do you evaluate the teacher model after step 1? Is the result the same as the original results without knowledge distillation in paper?
I just evaluated the last student model
Okk, thank you~
@locurryve Hi, I tried to re-run the experiments and seems it works ok. Actually, when this repo is released, some parameters are tuned towards another work during that time. One of the key finding is that the DocRED evaluation dataset is highly incompletely annotated, about 60+% triples are not reflected in the evaluation sets. Hence the precision and recall tend to fluctuate between epochs. My empirical observation is that evaluating on Re-DocRED is more stable https://arxiv.org/abs/2205.12696.
@locurryve Hi, I tried to re-run the experiments and seems it works ok. Actually, when this repo is released, some parameters are tuned towards another work during that time. One of the key finding is that the DocRED evaluation dataset is highly incompletely annotated, about 60+% triples are not reflected in the evaluation sets. Hence the precision and recall tend to fluctuate between epochs. My empirical observation is that evaluating on Re-DocRED is more stable https://arxiv.org/abs/2205.12696.
I see, so that's how matters stand. Thanks a lot!
Hi, thanks for your clear and great work.
I've got the problem that I couldn't reproduce the same results (without distant supervision) in paper. I wonder if any hyper parameter in current version of code is not optimal to produce the origin results? Do I need to modify any hyper parameter?
In my reimplementation, I just clone the code and run batch_roberta.sh directly on a single NVIDIA V100 GPU with 32 GB memory and do not modify any hyper parameter. The specific hyper parameters include 4 for train_batch_size, 1 for gradient_accumulation_steps, 4 for num_labels, 3e-5 for learning_rate, 1e-4 for classifier_lr, 1.0 for max_grad_norm, 0.06 for warmup_ratio, 30 for num_train_epochs, which are all the same as the origin scripts without modification. My reimplemented results with roberta large are around 63.2 and 61.2 for dev F1 and dev Ign F1, which are lower by a margin than the origin results in paper.
By the way, it seems that some hyper parameters is hard coded in the model and I'm not sure if it also works for bert-base, could you please share the hyper parameters for bert-base and roberta-large respectively and give some advice on reproducing the results?
Thanks a lot and sorry to bother!