matsengrp / sumrep

Summary statistics for repertoires
16 stars 6 forks source link

Levenstein distance calculation #2

Closed matsen closed 5 years ago

matsen commented 7 years ago

The goal here is to get the distribution of Levenstein distances between pairs of CDR3s.

This is the distribution obtained by uniformly sampling two sequences from a de-replicated repertoire and calculating their L distance. If we wanted a "weighted" version of this distribution, we would be sampling from the repertoire before dereplication. Note that this is rather different than dereplicating the CDR3s and then calculating pairwise distances.

Notes from meeting

krdav commented 7 years ago

jellyfish is awesome!!! Super speedy. Needs unicode input so this is how I import is for hamming distance in python 2/3.

import jellyfish
try:
    def hamming_distance(s1, s2):
        if s1 == s2:
            return 0
        else:
            return jellyfish.hamming_distance(s1, s2)
    assert(hamming_distance('ABC', 'ABCD') == 1)
except:
    def hamming_distance(s1, s2):
        if s1 == s2:
            return 0
        else:
            return jellyfish.hamming_distance(unicode(s1), unicode(s2))
    assert(hamming_distance('ABC', 'ABCD') == 1)

Notice the quick return is string are identical. This is a small constant overhead, but it often turns out to be super useful and time saving if just a few sequences are identical.

williamdlees commented 7 years ago

In the past I've used the python-Levenshtein package. pip finds and installs this quite readily, but it's not well supported. Documentation is here: http://www.coli.uni-saarland.de/courses/LT1/2011/slides/Python-Levenshtein.html. This morning I compared the performance with that of Jellyfish, using 1,000 CDR3 sequences I had to hand, with an average length of 19nt. Jellyfish was a little faster, taking about 0.9us per LD calculation compared to about 1us in python-levenshtein. So I would use Jellyfish from now on.

My modification to python-levenshtein to support a cutoff distance is here: https://github.com/williamdlees/python-Levenshtein. I wrote this to help with the clustering of sequences at small levenshtein distances, and it makes a really big difference in such a situation. It's a straightforward modification. If you intend to compare full V-region sequences, it might help.

All the best

William

On 03/08/2017 22:39, Erick Matsen wrote:

The goal here is to get the distribution of Levenstein distances between pairs of CDR3s.

This is the distribution obtained by uniformly sampling two sequences from a de-replicated repertoire and calculating their L distance. If we wanted a "weighted" version of this distribution, we would be sampling from the repertoire before dereplication. Note that this is rather different than dereplicating the CDR3s and then calculating pairwise distances.

Notes from meeting
  • @laserson https://github.com/laserson noted that Levenstein is in https://github.com/jamesturk/jellyfish
  • I think that @williamdlees https://github.com/williamdlees stated that he had an implementation that would terminate early if distances were above a given threshold.
  • @mikessh https://github.com/mikessh stated that he thought that the most interesting information would come from looking at CDR3s that had some threshold level of similarity (say, at least 80%). Otherwise this ends up being another version of overall diversity comparison. This is a good point, but we might as well see how it looks on some data sets.
  • @mikessh https://github.com/mikessh has also linked to https://github.com/rokroskar/imnet . This is nice in that it runs on Spark, but see the next point
  • We certainly don't need to compute every pairwise Levenstein distance to get their distribution. I'd think that sampling 10K distances would give us a perfectly sufficient picture of what the distribution looks like.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/matsengrp/sumrep/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AK4Ufb6pmuzR4iXNrxPRYiNuoxATkYmWks5sUj4bgaJpZM4OtBeo.

BrandenOlson commented 7 years ago

This is excellent, @williamdlees. Thanks!

BrandenOlson commented 5 years ago

I did a quick empirical analysis a while back and it seemed like using stringdist::stringdist in R worked as well if not better than calling the above from R. With the subsampling procedure implemented for pairwise distances, I think this issue is resolved (assuming the NN subsampling is a separate issue).