Closed wxicu closed 5 days ago
It takes about 5 seconds to calculate KDE of a group pair for the metric mean_var_disn
. There are 32 groups in the test dataset so I subset to only 5 of them for speeding up.
I am honestly not sure whether it makes to use the aggregation functions for all of them^^
I think I might have gotten it wrong. It says "pseudobulk" but I think that how you're using them in the distance now is something different? Do I get it wrong?
To be honest I copied "aggregation part" from the hackathon branch where aggregation function accepts np.mean/variance/median etc.
From my understanding: currently the majority of the distance metrics calculate the distance between 2 pseudobulk vectors. For those metrics, the aggregation function just provides different modes to aggregate the counts then just mean expression. For metrics which not really calculate the distance between 2 pseudobulk, , e.g. mean_var_distribution
, we dont use any aggregation funtcion there.
However the problem is that I am not sure whether it makes sense to generate pseudobulk vector with variance. So for now I would like to keep only mean, median and add sum for aggregation functions in distance metrics. For mediod, I dont think it is really an aggregation method, but a clustering method? So i also removed it. What do you think?
@wxicu all of this sounds reasonable to me.
But.....I didn't come up with more descriptive names for the parameters
PR Checklist
docs
is updatedDescription of changes Add distance metrics: mean_var_distn and mahalanobis
Technical details
Additional context
There is a question whether to store/return/accept the expensive inverse of the covariance matrix for mahalanobis metrix in the draft implementations on the hackathon branch. Do you think it make sense to save the invert in a multidimensional array? I didnt come up with a more efficient solution.