nju-websoft / OpenEA

A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs, VLDB 2020
GNU General Public License v3.0
517 stars 80 forks source link

Predict methods for models #16

Closed sven-h closed 3 years ago

sven-h commented 3 years ago

Hi,

thank you for your great library. I would like to use it for KG matching. As far as I can see, you evaluated the models on a test set but do not provide the full extracted alignment (or am I wrong?). I think it would be a good idea to have two kinds of predict methods: 1) it gets a correspondence between e1 and e2 and returns the distance/confidence between them 2) given a k value (for top-k retrieval), extract all correspondences (for example via greedy search) and write them to a file where each line contains e1, e2, and distance/confidence. Then one can use and further experiment with them as well as evaluating these correspondences 'offline'. What do you think?

I have two more questions:

1) Regarding the greedy search: If the number of entities in the two graphs are not equal (contrary to your datasets), then always picking the first KG to search for correspondences can lead to many/few correspondences. The question is whether to choose the KG with fewer or more entities. 2) During test, you look up the embeddings of the entities appearing only in the test set (see line 117 in the basic model as well as in the other models ). And based on these embeddings you calculate the greedy alignment and the evaluation measures (e.g. hits@k - see line 131). Is this correct or do I oversee something? Because if this is the case you would only rank elements which appear in the test set. And this would be some kind of leakage because in a prediction step you would rank all elements and choose the best k. In case the model rank elements not in the test set very high, then these would not appear in the evaluation. If the computation of the nearest neighbours is very costly, maybe the libraries for nearest neighbor search help (faiss, annoy, and a benchmark )? Maybe you can elaborate a bit more on this?

Stay healthy Best regards Sven

sunzequn commented 3 years ago

Hi Sven,

Sorry for my late reply, and thank you for your interest in our work.

As for the proposed two kinds of prediction methods, I will consider adding these two methods in OpenEA.

On your first question regarding the greedy search, the alignment direction indeed affects the final performance. But, it depends on both the data distribution and your embedding model. So, you can try both alignment direction. I will add functions in OpenEA to support entity alignment of unbalanced KGs.

On your second question regarding the test method, this is because we already know the counterparts of the entities in training and validation data, and only the test entities in the source KG should be considered to find their counterparts in the test entities of the target KG. There is no leakage issue in the training stage. In the case you mentioned where some elements may have high ranks but not in the test set, these elements are entities in training data. The objective of the training is to align these training entities and distinguish between both aligned and unaligned entities. In the test stage, we do not need to consider entities in training data.

As for computation complexity, I will add an alignment inference method using Faiss.

Thanks for your suggestions!

Best, Zequn

sven-h commented 3 years ago

Hi Zequn,

thank you for your answer.

As for the proposed two kinds of prediction methods, I will consider adding these two methods in OpenEA.

That would be really great. Thank you very much for that. Do you know when you have time to implement it? If you can give me some pointers on how you plan to implement it, I might be able to help a little bit to have the implementation rather sooner than later.

In the case you mentioned where some elements may have high ranks but not in the test set, these elements are entities in training data.

Okay, I understand that all enties in your dataset are aligned. And they are either in the train, validation or test set. For such an evaluation, you probaly "retrain" your approaches and use the union of the train and validation set, to be sure, that all entites not in test are seen by the model. Thus your evaluation works for such datasets only - which is perfectly fine for me.

I just want to try out the approches in a more general case were we have more elements (which may not be aligned at all). I will then do the evaluation on my own based on the results of the predict methods discussed above.

As for computation complexity, I will add an alignment inference method using Faiss.

You do not necessarily have to add it - it was just a hint/question if this would speed up the computation.

Best regards Sven

sunzequn commented 3 years ago

Hi Sven,

You can implement the two prediction methods based on the existing functions '_greedy_alignment(embed1, embed2, top_k, nums_threads, metric, normalize, cslsk, accurate)' and '_find_alignment(sim_mat, simth, k)'.

I will add these methods this weekend.

sven-h commented 3 years ago

Hi Zequn,

that would be really great. Thank you once again. If the methods are added, I will test them and give you feedback.

Best regards Sven

sunzequn commented 3 years ago

Hi Sven,

I have added the method for top-k retrieval. It seems that you can also use it for your first function.

Best, Zequn

sven-h commented 3 years ago

I have added the other method as well for the prediction of given entities in another pull request