Open samuelbroscheit opened 5 years ago
Clarification: for probabilistic models, add computation of cross entropy loss to evaluation
On 2nd thought, not sure how useful this is (you'd win by always predicting 1). Samuel, please clarify or close.
How would you win for labels [0,0,1] with prediction [1,1,1] vs prediction [0,0,1] with BCE?
I was referring to the cross entropy of the test triples (all label 1) and their predicted confidence. Anyway, it's not clear to me what "loss" should be added for the evaluation.
(Recall: the evaluation is currently triple based, not sp/po based.).
I think usually you want the same loss that is used during training.
Yes, but during evaluation, we batch triples not so/po pairs. If this is needed, we need a different evaluation job (and another one for negative sampling, I guess).
Or just get the collate func from the Trainer and join it with the eval collate func? Shouldn't be that difficult.
def get_collate_func(trainer_collate_func):
def my_collate_func(batch):
my_result = doing stuff
trainer_result = trainer_collate_func(batch)
return my_result, trainer_result
return my_collate_func
I always implent validation loss automatically, but I am not going to fight for it if you are not convinced.
It's not so easy because (i) the evaluation code has two distinguish two types of things and (ii) we are scoring the same thing multiple times.
The best approach may be: rewrite the evaluation to not use triples but the sp/po approach (i.e., the 1-to-N collate as is). Then compute the MRR/HITS implementation on top of this. In addition to enable the computation of the loss, it should also be faster than our current approach since it computes less scores (e.g., if spo_1 and spo_2 occur in the validation data, we currently compute sp? twice).
To me this is similar to the including test data during model selection: people do it but perhaps they shouldn't. While there is nothing wrong with selecting models based on the standard metrics, perhaps selecting based on loss is better. I'd personally like to see the difference, so if is not too much change, we could keep the option. Then, selecting what is the default behavior is the only required decision.
No it's different: it's perfectly fine to use the loss on validation data. The problem is that we cannot easily compute it right now without changing the way we implemented validation. I was suggesting to change how our implementation to (1) support loss computation and (2) make it faster.
The validation is currently by far the slowest part in training (when multiple jobs are run in parallel so that the GPU is saturated, most of the time is spent in validation). Addressing this issue should also help here. I'll add a priority tag because of this.
https://github.com/rufex2001/kge/blob/9d83e43f5085e4a0d30d70536e4c1772389907cd/kge/job/entity_ranking.py#L72