knncolle / BiocNeighbors

Clone of the Bioconductor repository for the BiocNeighbors package.
https://bioconductor.org/packages/devel/bioc/html/BiocNeighbors.html
6 stars 10 forks source link

Valid Distance Keywrods Not Specified #22

Open DarioS opened 10 months ago

DarioS commented 10 months ago

distance: A string specifying the distance metric to use.

O.K. but what are permitted values? Perhaps KmknnParam should have distance = c('A', 'B', 'C') and inside use match.arg. I am keen to try Louvain clustering with a Pearson correlation distance, for instance.

LTLA commented 10 months ago

Probably should say "See ?buildKmknn for details", as it just forwards its arguments to that. Happy to take a PR for the doc fix (might consider doing it for all constructors). I couldn't be bothered repeating the whole vector of options in the KmknnParam constructor, then I would have to update things in two places if I ever added a new distance metric.

It seems that all methods are restrained to distance = c("Euclidean", "Manhattan", "Cosine") (see ?bndistance). I suppose we could also have method-specific distances, for those that support them, e.g., Hamming for Annoy. I haven't come across a demand for that yet, though again, PRs are welcome if someone can be bothered.

Note that my "Cosine" is really just L2-normalized euclidean because the typical "1 - cosine similarity" isn't a real distance metric. I mention this because, as you may already know, "1 - correlation" isn't a distance metric either. Instead I typically center and scaled the input values for each sample, and then use the typical Euclidean distance, which turns out to be a monotonic function of the Pearson correlation (sqrt((1 - correlation)/2), or something like that, see here).