Reproducibility - Githubissues

martinsvat commented 3 years ago

Hi guys, good work, but I struggle a bit with reproducing your results. It's nothing serious, but it would be better to have clone-and-use approach. So far I encounter these little obstacles:

hardcoded paths, e.g. _USERHOME+"/workspace/StAR/data/"
It's a handy thing to have the exact commands to reproduce WN18RR in the readme, but I guess there are some incorrect paths, e.g. to run.py in RotatE/ folder with _./result/WN18RRroberta-large/ (that should be ../StAR/results/WN18RRroberta-large, right?); also guessing there is a typo in --output_dir ./result/FB15k-237roberta-largel
When I followed your readme's commands exactly, I ended up with _FileNotFoundError: [Errno 2] No such file or directory: './result/UMLS_model_roberta-large/similarity_scoremtx.npy'. The only place where _*/similarity_scoremtx.npy is saved is in _get_ensembleddata.py which is not invocated anywhere in the project (although commented in _get_ensembleddata.py).

Please, can you provide a fix for the _similarity_scoremtx.npy missing file? I could simply remove the commented line but there is no mention of how to use _get_ensembleddata.py.

best Martin

wangbo9719 commented 3 years ago

Thanks for reporting these bugs. Sorry for the late response.

What you said is correct, sorry for the bad running experience. I have fixed them in this version. Now, the similarity_score_mtx.npy can be obtained by removing the comment of the get_similarity() function in get_ensembled_data.py.

Thanks again for your reporting.

martinsvat commented 3 years ago

Thanks for the fix! However, I'm still having issues getting the same results as those in your paper. Namely, when I want to reproduce WN18RR, I do (as stated in the readme)

get_new_dev_dict.py (twice with different parameters)
run_link_prediction.py (4.1)
learning RotatE using their best hyperparameter setup: bash run.sh train RotatE wn18rr 0 0 512 1024 500 6.0 0.5 0.00005 80000 8 -de
run_get_ensemble_data.py
./codes/run.py
./ensemble/run.py once with tail and once with head mode In the end, I end up with head Hits @1: 0.20357370772176134 head Hits @3: 0.4572431397574984 head Hits @10: 0.6726228462029356 headMean rank: 57.47383535417996 headMean reciprocal rank: 0.3644130875627938 and ---------tail, test, lr=0.001, ep=3.0, nt=5, margin=0.6, bs=32 feature=mix metric ---------- tail Hits @1: 0.2906828334396937 tail Hits @3: 0.5370134014039566 tail Hits @10: 0.7565411614550096 tailMean rank: 54.801850670070195 tailMean reciprocal rank: 0.44718844813012876

Now, taking average of these, e.g. hits1 is not the same as in the paper: see, (tail-hits1 + head-hits1) / 2 = 0.24712827058072752, meanwhile there is 0.459 in the paper (table 4).

Please, can you provide more information (e.g. hyperparameter setup for RotatE's model) to get the exact results as you have in the paper? Or, did I use the commands incorrectly somewhere? (For example, not executing the last one, ensemble/run.py, twice with different modes, but the first time with train and the second time with --init?) I'd like to use StAR model but I need to have correct results for the start.

best Martin

wangbo9719 commented 3 years ago

Your running commands seems correct. And I just use the official hyperparameter of RotatE to train the model on WN18RR.

The data about the trained model reported in paper was lost. I will reproduce the results recently and then tell you the results.

By the way, how about your obtained results of StAR and RotatE on WN18RR?

martinsvat commented 3 years ago

By the way, how about your obtained results of StAR and RotatE on WN18RR?

Final lines from train.log Valid MRR at step 79999: 0.478470 Valid MR at step 79999: 3284.908372 Valid HITS@1 at step 79999: 0.432597 Valid HITS@3 at step 79999: 0.493243 Valid HITS@10 at step 79999: 0.571523 Evaluating on Test Dataset... ... Test MRR at step 79999: 0.476083 Test MR at step 79999: 3369.924059 Test HITS@1 at step 79999: 0.428207 Test HITS@3 at step 79999: 0.494416 Test HITS@10 at step 79999: 0.571315

wangbo9719 commented 3 years ago

And the results of StAR?

martinsvat commented 3 years ago

Sorry, and thanks for help. Here is content of _WN18RR_roberta-large/link_predictionmetrics.txt Hits left @1: 0.20261646458200383 Hits right @1: 0.2782386726228462 ###Hits @1: 0.240427568602425 Hits left @3: 0.45213784301212506 Hits right @3: 0.5188257817485641 ###Hits @3: 0.4854818123803446 Hits left @10: 0.6668793873643906 Hits right @10: 0.7479259731971921 ###Hits @10: 0.7074026802807913 Mean rank left: 57.20835992342055 Mean rank right: 53.99298021697511 ###Mean rank: 55.60067007019783 Mean reciprocal rank left: 0.3616734820860267 Mean reciprocal rank right: 0.4341342479524534 ###Mean reciprocal rank: 0.39790386501924

Which seems quite similar to what's in table 4. RotatE's results are also quite similar to what's in table 4.

wangbo9719 commented 3 years ago

Got it. I will try to find out the reason and tell you later.

martinsvat commented 3 years ago

Hi, any success reproducing the results?

Meanwhile, I have another question regarding the ensembling model. It is learned twice for tail and head prediction tasks, right? So, if one has a little bit different prediction task to predict the value of a triple (e1, r2, e3), he has to average outputs for both queries (e1, r2, e3) to head-learned and tail-learned model, right?

thx

wangbo9719 commented 3 years ago

Sorry for the very late response.

There were some bugs in the codes and commands before. Thanks for reporting. I have updated this repo. To reproduce the ensemble results, please follow the new version and rerun the last command in 5.1:

CUDA_VISIBLE_DEVICES=3 python ./codes/run.py \
    --cuda --init ./models/RotatE_wn18rr_0 \
    --test_batch_size 16 \
    --star_info_path /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large \
    --get_scores --get_model_dataset

By the way, the performance of ensemble model may not be stable enough. For the command in 5.2, you can just use ‘add’ for –feature_method and do_prediction only to get a suboptimal result which is corresponding to the StAR (Ensemble) in Table 4 of the paper.

For your second question, I think what you said is a way to solve the triple classification task. Or you can modify the code to adapt to the task. You can refer to the code of KG-BERT who implements triple classification.

wangbo9719 commented 3 years ago

Sorry. I fixed a small bug just now. If you have followed the last version, the generated files are saved in the wrong paths and names. You can move the file to the correct directory.

wangbo9719 / StAR_KGC

Reproducibility #3