pykeen / pykeen

🤖 A Python library for learning and evaluating knowledge graph embeddings
https://pykeen.readthedocs.io/en/stable/
MIT License
1.63k stars 186 forks source link

Request for more documentation on sklearn based evaluation. #483

Closed rohithteja closed 3 years ago

rohithteja commented 3 years ago

Hello, Firstly, great work and very good compilation of knowledge graph embedding methods. The implementation of rank based evaluation is clear but there is very less documentation on sklearn based evaluation. I would like to know more details on how the sklearn based evaluation is done and maybe some explanation on comparison of two evaluation methods.

Thank you.

cthoyt commented 3 years ago

Hi @rohithteja. It's definitely the case that the scikit-learn evaluation needs some love. If you're interested in using that, it would be great if you could help us out. Some of the things we're not really sure about:

  1. Is there a reason we shouldn't always report both rank-based metrics and sklearn metrics? The evaluation takes place with an evaluator class that lets us group all of the methods that calculate metrics in a similar way. However, we also have a different results class for each kind. I'm now less sure about this making sense than when we implemented it originally.
  2. There are a few ways of aggregating ranks in a way such they can be passed to sklearn-type metrics. Of course you have to keep in mind that the KGEMs are only using positive triples, so to make them meaningful, there are a couple hoops to jump through. I admit I'm not super knowledgable about this - @lvermue did the heavy lifting in the implementation and might have more to say.

But, if you want to use sklearn as the evaluator, it should be easy enough just to say that's what you want in the pipeline

from pykeen.pipeline import pipeline

result = pipeline(
    model='RotatE',
    dataset='Kinships',
    evaluator='sklearn',
)

There are a few warnings related to filtered evaluation that popped up when I did this, so we probably want to take a look into those as well.

rohithteja commented 3 years ago

Thanks for the reply. For my initial experiments on some datasets, I used both rank based and sklearn based evaluations. I got different embeddings with both evaluations. When I used the generated embeddings on a node classification task, the embeddings I got using sklearn evaluation performed better. This is why I am interested in knowing more about sklearn evaluation.

I tried debugging the code to understand the implementation. Could you explain more on the "y_true" variable used in the code for sklearn evaluation.

def finalize(self) -> SklearnMetricResults: 
      all_keys = list(self.all_scores.keys())
      y_score = np.concatenate([self.all_scores[k] for k in all_keys], axis=0).flatten() 
      y_true = np.concatenate([self.all_positives[k] for k in all_keys], axis=0).flatten()

I am thinking it is just the mask values of test triples in the graph (1 if a triple is present in test set and 0 when it is absent). If so why aren't we considering the corrupted triples for evaluation ( As you mentioned we are only using positive triples for evaluation). I am still learning this subject, feel free to correct me and I would like a clear explanation from someone who worked on it.

mberr commented 3 years ago

Thanks for the reply. For my initial experiments on some datasets, I used both rank based and sklearn based evaluations. I got different embeddings with both evaluations. When I used the generated embeddings on a node classification task, the embeddings I got using sklearn evaluation performed better. This is why I am interested in knowing more about sklearn evaluation. (by @rohithteja )

I guess here you are referring to using the metrics for model selection / early stopping? The rank-based evaluator (as well as the sklearn one) produces multiple different metrics, all of which measure different aspects of performance. You may want to experiment with different metrics. Commonly used in Link Prediction repositories is the Mean Reciprocal Rank (MRR), which is also called inverse_harmonic_mean_rank in our repository after we generalized much of the rank aggregation metrics in #381.

Is there a reason we shouldn't always report both rank-based metrics and sklearn metrics? The evaluation takes place with an evaluator class that lets us group all of the methods that calculate metrics in a similar way. However, we also have a different results class for each kind. I'm now less sure about this making sense than when we implemented it originally. (by @cthoyt )

There likely is a difference in performance: the sklearn metrics require the "raw" scores, i.e. arrays of shape (num_entities,) for each triple and prediction side, and they use sklearn's cpu implementation. For rank-based evaluation, we compute the ranks of GPU, and only keep 1 rank for each triple and side. Thus, the rank-based evaluator should be faster and consume less memory.

Could you explain more on the "y_true" variable used in the code for sklearn evaluation. [...] I am thinking it is just the mask values of test triples in the graph (1 if a triple is present in test set and 0 when it is absent). If so why aren't we considering the corrupted triples for evaluation ( As you mentioned we are only using positive triples for evaluation). I am still learning this subject, feel free to correct me and I would like a clear explanation from someone who worked on it.

Our evaluation currently always uses 1-n scoring (ignoring entity-restricted evaluation for now), i.e., for each evaluation triple (h, r, t) we compute scores for (e, r, t) and (h, r, e) for all entities. Thus, during evaluation no sampling by corruption is applied, but rather all entities are considered. This ensures more reliable results and is compliant with most of the benchmarking datasets' evaluation (For very large graphs, e.g., from OGB, sampling is sometimes used again).

For the sklearn metrics, we keep exactly these arrays, and construct a 1-hot label array just as you described.

rohithteja commented 3 years ago

I guess here you are referring to using the metrics for model selection / early stopping? The rank-based evaluator (as well as the sklearn one) produces multiple different metrics, all of which measure different aspects of performance. You may want to experiment with different metrics. Commonly used in Link Prediction repositories is the Mean Reciprocal Rank (MRR), which is also called inverse_harmonic_mean_rank in our repository after we generalized much of the rank aggregation metrics in #381.

Yes I used the metrics to select the best model and I understand.

Our evaluation currently always uses 1-n scoring (ignoring entity-restricted evaluation for now), i.e., for each evaluation triple (h, r, t) we compute scores for (e, r, t) and (h, r, e) for all entities. Thus, during evaluation no sampling by corruption is applied, but rather all entities are considered. This ensures more reliable results and is compliant with most of the benchmarking datasets' evaluation (For very large graphs, e.g., from OGB, sampling is sometimes used again).

For the sklearn metrics, we keep exactly these arrays, and construct a 1-hot label array just as you described.

Thanks a lot for the clarification. I might be wrong saying this and just to make it clear, from your answer ("compute scores for (e, r, t) and (h, r, e) for all entities"), does any of the (e, r, t) or (h, r, e) triples are cases of negative examples (which are not present in the graph) since all entities are considered to compute the scores.

mberr commented 3 years ago

Thanks a lot for the clarification. I might be wrong saying this and just to make it clear, from your answer ("compute scores for (e, r, t) and (h, r, e) for all entities"), does any of the (e, r, t) or (h, r, e) triples are cases of negative examples (which are not present in the graph) since all entities are considered to compute the scores.

Yes, kind of. Some of (likely most of) the scores are for triples which are not part of the KG, neither in train nor evaluation part, and thus not known to be true. However, they may still be true, and just unknown to us.

For the sklearn evaluation, we put them into the negative class. Thus, we investigate how well the model can separate the positive and the unknown (=negative) class.

rohithteja commented 3 years ago

Thank you for the explanation. The working of the evaluation is clear for me now. I am closing this issue now.