theislab / scib

Benchmarking analysis of data integration tools
MIT License
294 stars 63 forks source link

Overflow in adjusted rank index #238

Closed alitinet closed 3 years ago

alitinet commented 3 years ago

Hi all,

I encountered the issue of integer overflow several times now when calculating ARI for big datasets (>150k cells). The error message is

/home/icb/litinetskaya/miniconda3/envs/multigrate/lib/python3.7/site-packages/sklearn/metrics/cluster/_supervised.py:389: RuntimeWarning: overflow encountered in long_scalars
  return 2. * (tp * tn - fn * fp) / ((tp + fn) * (fn + tn) +
/home/icb/litinetskaya/miniconda3/envs/multigrate/lib/python3.7/site-packages/sklearn/metrics/cluster/_supervised.py:390: RuntimeWarning: overflow encountered in long_scalars
  (tp + fp) * (fp + tn))

and then the resulting score is a negative number, sometimes a pretty big negative number.

Casting the output of the confusion matrix to floats solves the issue, i.e. instead of returning https://github.com/theislab/scib/blob/94b3103d445e7f7937d99c4ff2b448a9e47f94ff/scIB/metrics.py#L309 one might do the following instead (adjusted from ARI source code) :

from sklearn.metrics import pair_confusion_matrix

(tn, fp), (fn, tp) = pair_confusion_matrix(group1, group2)
(tn, fp), (fn, tp) = (float(tn), float(fp)), (float(fn), float(tp))
if fn == 0 and fp == 0:
    return 1.0

return 2. * (tp * tn - fn * fp) / ((tp + fn) * (fn + tn) +
                                       (tp + fp) * (fp + tn))
Hrovatin commented 3 years ago

I get the same problem of negative values.

LuckyMD commented 3 years ago

I love how you're fixing sklearn code @alitinet! Do you want to submit a PR here or to sklearn? It seems like this may be better fixed upstream, no?

LuckyMD commented 3 years ago

In general the ARI can be negative (not something we initially considered in the rescaling though). This happens if the clustering is worse than via random permutations. Typically that negative number should be very near 0 though, as we're not specifically looking for alternative clusterings here... unless the integration method produces an output that has switched batch and cell type or sth like that.

mumichae commented 3 years ago

Negative values are possible with ARI per definition. However, together with the runtime warning, the resulting ARI is likely wrong. I was getting ARI larger than 1 (which should never happen).

Someone has already reported this issue to scikit-learn a couple of months ago and it seems to have been fixed but it's not in any available package version from what I could see.

I reimplemented the metric (didn't see your fix when I did) and get proper values now. Still need to decide how to integrate it though, as I'm in the middle of refactoring the metrics module.

mumichae commented 3 years ago

This issue should be fixed with #263