lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.45k stars 808 forks source link

Feature request: Sequence similarity metric #419

Open prihoda opened 4 years ago

prihoda commented 4 years ago

Hi all, I was thinking that it would be useful to provide a sequence similarity metric to enable comparing protein or DNA sequences. The metric could compute all-by-all similarities or even employ some more sophisticated techniques to produce a sparse similarity matrix or the neighbourhood graph directly.

Is this something you would consider, or do you think this is out of scope?

The current way of solving it is to use some other tool to generate a distance matrix or a neighbourhood graph matrix and then embed it using UMAP precomputed or exact metric. But this is non-trivial and there are no relevant tutorials that would demonstrate this (at least not that I know of).

lmcinnes commented 4 years ago

If you provide some pseudocode of how to compute the metric I could probably implement it.

prihoda commented 4 years ago

Sounds great. First we need to think about the input sequences, which are strings of variable length. So to keep it aligned with the other metrics, do you think it would make sense to expect the sequences as a two-dimensional matrix, but with just a single column of strings (numpy object dtype)?

The first metric that comes to mind is the Levenshtein distance metric, which counts the number of insertions, deletions and substitutions needed to turn the first sequence into the second sequence. This produces an integer score that increases with the length of the sequence, which is not desirable since longer sequence pairs are favored. This can be solved by using a normalized metric (length_of_both_strings - Levenshtein_distance) / length_of_both_strings. Length of both strings is the maximum distance since you can always delete all characters of the first sequence and insert all characters of the second sequence. This is already implemented by the python-Levenshtein module: https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html which is written in C. Would it be possible to use it as an extra dependency (e.g. using umap-learn[seq]?

The standard algorithm for biological sequences is Needleman–Wunsch which finds the best global alignment of two sequences: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm. Different insertions, deletions and substitution pairs (=mutations) are penalized differently based on biological information in the form of a substitution matrix. There are different commonly used substitution matrices for DNA and proteins, one of each could be chosen to start with. The algorithm produces an integer similarity score which corresponds to the best alignment. But again, this score increases with the length of the sequence and it can even be negative for dissimilar sequences. So there are no straightforward normalization methods (that I know of). So the two alternative solutions here are to use the fraction of identical pairs in the alignment (% identity) or the fraction of pairs with a positive substitution score (% similarity). So these could be called needlemanwunsch_dna_identity, needlemanwunsch_dna_similarity, needlemanwunsch_protein_identity, needlemanwunsch_protein_similarity.

My suggestion would be to start with the levenshtein metric, since it doesn't take any extra arguments and it can be applied to different domains (language, biology, chemistry). Then the NW metrics could be added later.

What do you think? Let me know how I can help.

prihoda commented 4 years ago

Also it's worth to note that both of these algorithms have length1*length2 complexity, which can become very demanding when filling in the all-by-all matrix. But it can still be very useful for small datasets. There are also methods which create a sparse distance matrix (with all distances under given threshold) such as USEARCH, but those are much more complex.

prihoda commented 4 years ago

So I was thinking about USEARCH and I gave a shot at creating a package that connects USEARCH and UMAP: https://github.com/prihoda/usum

It enables creating a sequence similarity embedding using a fast method that creates a sparse distance matrix . The USEARCH tool needs to be downloaded manually, so it's not perfect. Therefore I still think a sequence similarity metric is relevant for UMAP as well.

BTW, while I have you on the line, can you check out this issue on the USEARCH tool? https://github.com/prihoda/usum/issues/2

gokceneraslan commented 3 years ago

There is a numba jitted levenshtein here https://gist.github.com/tuxedocat/fb024dfa36648797084d#file-levd-py-L26 by @tuxedocat. It can be added to the distances in umap/pynndescent to calculate the distances between strings of unequal length, but the input of the UMAP will be a list of strings which is not possible now due to the check_array() calls which checks if the input is a 2D array.

gokceneraslan commented 3 years ago

Regarding https://github.com/prihoda/usum/issues/2, @lmcinnes is there an easy way to use the output of an arbitrary kNN algorithm in the form of (knn_indices, knn_distances) or a sparse matrix (where only the nearest neighbors have distances) in UMAP as a "precomputed" distance?

lmcinnes commented 3 years ago

I don't believe there is a straightforward approach that would do the job, no. Sorry. I think it wouldn't be too hard to add such a possibility as extra input parameters to the fit call, or special handling of sparse precomputed distance matrices.

To get around the 2d array check you can rewrite the distance to check the string distance between the first element of each array and have a 2D array where each row is a length 1 array with a string. I'm not sure if that won't fall afoul of other processing elsewhere however (there may be a desire to convert things to floats buried in the code, for instance). This is something I would eventually like to fix -- allowing more interesting distances such as string based distances, normalized compression distance, etc. and thus allowing a lot more variety of input, but there are a lot of fiddly details to make everything play nice in what is now a rather large codebase. Hopefully this might be a 0.6 release feature.