Rewrite entity ranking evaluation to sample sp/po pairs and add loss

uma-pi1 / kge

LibKGE - A knowledge graph embedding library for reproducible research

MIT License

777 stars 124 forks source link

Rewrite entity ranking evaluation to sample sp/po pairs and add loss #2

Open samuelbroscheit opened 5 years ago

samuelbroscheit commented 5 years ago

https://github.com/rufex2001/kge/blob/9d83e43f5085e4a0d30d70536e4c1772389907cd/kge/job/entity_ranking.py#L72

rgemulla commented 5 years ago

Clarification: for probabilistic models, add computation of cross entropy loss to evaluation

rgemulla commented 5 years ago

On 2nd thought, not sure how useful this is (you'd win by always predicting 1). Samuel, please clarify or close.

samuelbroscheit commented 5 years ago

How would you win for labels [0,0,1] with prediction [1,1,1] vs prediction [0,0,1] with BCE?

rgemulla commented 5 years ago

I was referring to the cross entropy of the test triples (all label 1) and their predicted confidence. Anyway, it's not clear to me what "loss" should be added for the evaluation.

(Recall: the evaluation is currently triple based, not sp/po based.).

samuelbroscheit commented 5 years ago

I think usually you want the same loss that is used during training.

rgemulla commented 5 years ago

Yes, but during evaluation, we batch triples not so/po pairs. If this is needed, we need a different evaluation job (and another one for negative sampling, I guess).

samuelbroscheit commented 5 years ago

Or just get the collate func from the Trainer and join it with the eval collate func? Shouldn't be that difficult.

def get_collate_func(trainer_collate_func):
    def my_collate_func(batch):
        my_result = doing stuff
        trainer_result =  trainer_collate_func(batch)
        return my_result, trainer_result
    return my_collate_func

I always implent validation loss automatically, but I am not going to fight for it if you are not convinced.

rgemulla commented 5 years ago

It's not so easy because (i) the evaluation code has two distinguish two types of things and (ii) we are scoring the same thing multiple times.

The best approach may be: rewrite the evaluation to not use triples but the sp/po approach (i.e., the 1-to-N collate as is). Then compute the MRR/HITS implementation on top of this. In addition to enable the computation of the loss, it should also be faster than our current approach since it computes less scores (e.g., if spo_1 and spo_2 occur in the validation data, we currently compute sp? twice).

rufex2001 commented 5 years ago

To me this is similar to the including test data during model selection: people do it but perhaps they shouldn't. While there is nothing wrong with selecting models based on the standard metrics, perhaps selecting based on loss is better. I'd personally like to see the difference, so if is not too much change, we could keep the option. Then, selecting what is the default behavior is the only required decision.

rgemulla commented 5 years ago

No it's different: it's perfectly fine to use the loss on validation data. The problem is that we cannot easily compute it right now without changing the way we implemented validation. I was suggesting to change how our implementation to (1) support loss computation and (2) make it faster.

rgemulla commented 5 years ago

The validation is currently by far the slowest part in training (when multiple jobs are run in parallel so that the GPU is saturated, most of the time is spent in validation). Addressing this issue should also help here. I'll add a priority tag because of this.