Add two distance metrics, three-way comparison and bootstrapping

wxicu commented 1 month ago

PR Checklist

[x] Referenced issue is linked #603
[x] If you've fixed a bug or added code that should be tested, add tests!
[ ] Documentation in docs is updated

Description of changes Add distance metrics: mean_var_distn and mahalanobis

Technical details

Additional context

There is a question whether to store/return/accept the expensive inverse of the covariance matrix for mahalanobis metrix in the draft implementations on the hackathon branch. Do you think it make sense to save the invert in a multidimensional array? I didnt come up with a more efficient solution.

wxicu commented 3 weeks ago

It takes about 5 seconds to calculate KDE of a group pair for the metric mean_var_disn. There are 32 groups in the test dataset so I subset to only 5 of them for speeding up.

wxicu commented 3 weeks ago

I am honestly not sure whether it makes to use the aggregation functions for all of them^^

I think I might have gotten it wrong. It says "pseudobulk" but I think that how you're using them in the distance now is something different? Do I get it wrong?

To be honest I copied "aggregation part" from the hackathon branch where aggregation function accepts np.mean/variance/median etc. From my understanding: currently the majority of the distance metrics calculate the distance between 2 pseudobulk vectors. For those metrics, the aggregation function just provides different modes to aggregate the counts then just mean expression. For metrics which not really calculate the distance between 2 pseudobulk, , e.g. mean_var_distribution, we dont use any aggregation funtcion there.

However the problem is that I am not sure whether it makes sense to generate pseudobulk vector with variance. So for now I would like to keep only mean, median and add sum for aggregation functions in distance metrics. For mediod, I dont think it is really an aggregation method, but a clustering method? So i also removed it. What do you think?

Zethson commented 3 weeks ago

@wxicu all of this sounds reasonable to me.

wxicu commented 2 weeks ago

But.....I didn't come up with more descriptive names for the parameters

scverse / pertpy

Add two distance metrics, three-way comparison and bootstrapping #608