[MRG] PredictionEntropyScorer output negative scores

scikit-adaptation / skada

Domain adaptation toolbox compatible with scikit-learn and pytorch

https://scikit-adaptation.github.io/

BSD 3-Clause "New" or "Revised" License

60 stars 16 forks source link

[MRG] PredictionEntropyScorer output negative scores #63

Closed YanisLalou closed 7 months ago

YanisLalou commented 8 months ago

Issue: PredictionEntropyScorer output negative scores Solution: In PredictionEntropyScorer we compute equation (page 3 paper: https://arxiv.org/pdf/1711.10288.pdf): $$[ E(XT) = - \sum{x_t \in T} \langle f(x_t; \theta), \log f(x_t; \theta) \rangle ]$$

In the code we forgot to use the greater_is_better flag + we use a double minus sign in the formula. + Add test case to test that all scores computed are > 0

codecov[bot] commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (c50867e) 88.83% compared to head (2ed1c2c) 88.95%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #63 +/- ## ========================================== + Coverage 88.83% 88.95% +0.12% ========================================== Files 41 41 Lines 2678 2708 +30 ========================================== + Hits 2379 2409 +30 Misses 299 299 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

YanisLalou commented 8 months ago

Actually there was another mistake. Now we compute directly the mean of the entropy of each sample. However there is still something bothering. Now the score outputted is bounded by -xlog(x) with x a probability. Thus $score \in [0, \frac{1}{e}*\log\frac{1}{e} ]$ and not $\in [0, 1]$ as we could expect for a score.

tgnassou commented 8 months ago

I reread the paper. For entropy scorer, we want to minimize the entropy, so greater_is_better equals False here. And the entropy is not a score, so that's why it is not between [0, 1]

tgnassou commented 8 months ago

Maybe we can change the name

YanisLalou commented 8 months ago

Yes my bad you're right greater_is_better equals False here. So yes we should change the name OR we could scale the entropy to be between [0, 1] and call it a score (we kind of already do that by computing a mean instead of a sum) Tbh I do prefer the last choice, since it'll be easier to plot the results next to other scores bounded between [0, 1] like the accuracy, recall, precision....

tgnassou commented 8 months ago

I think the best thing is to put a sum instead of the mean and implement what is done in the paper. I think the mean was a mistake. We can plot it like a loss.

YanisLalou commented 8 months ago

Not a fan of removing the mean. If we do that the output will be proportionate to the number of samples. Thus it will be hard to compare methods effectiveness between different datasets. (plus if we're considering this class as a loss, in pytroch by default they compute the mean(L) instead of the sum(L))

YanisLalou commented 8 months ago

Would love to have the thoughts of @kachayev @rflamary

tgnassou commented 8 months ago

Yeah, maybe a parameter reduction to be able to fit the paper if wanted with option None or mean.

rflamary commented 8 months ago

good idea to handle both (state in teh doc that one of them corespod to the paper)

kachayev commented 8 months ago

That's an interesting question...

Usually, I'm a 'theory absolutist' when it comes to the questions like this one. If we call it 'entropy', it should perform 'sum'. Otherwise it's no longer entropy.

In PyTorch, 'reduction' parameter for losses is typically (as it should be) a 'batch-related' setting: it only tells the system what to do with the fact the the loss is typically per-item thing, which means that for a batch we have an array of those, while the gradient descent requires a scalar. Thus it provides 'sum' or 'mean' options for how to collapse the array of values into a scalar. This doesn't change the nature of the loss itself, just a mini-batch GD. Also, from the gradient perspective it's just a matter for multiplicative scalar.

You are right that with larger number of samples max entropy of the system increases. As it should... The hypothesis that 'entropy divided by the number of samples' makes better comparison between different methods is interesting though by no means obvious or intuitive.

kachayev commented 8 months ago

Also,

$\in [0, 1]$ as we could expect for a score

Where is this stated for sklearn scores to be from a particular range? I'm not sure I've seen this requirement.

YanisLalou commented 8 months ago

Also,

∈[0,1] as we could expect for a score

Where is this stated for sklearn scores to be from a particular range? I'm not sure I've seen this requirement.

It's not a requirement, but to the best of my recollection, I believe that every class ending with the term 'score' - 'scorer' is upper-bounded by 1. Otherwise, they are referred to as 'loss,' 'error,' and so on. I'm not 100% sure of that, but I quickly looked at: https://scikit-learn.org/stable/modules/model_evaluation.html and it seems to correlate with what I'm saying.