mkirsche / Jasmine

Jasmine: SV Merging Across Samples
MIT License
175 stars 16 forks source link

SV not correctly merged #47

Open wjwei-handsome opened 1 year ago

wjwei-handsome commented 1 year ago

Hello author: jasmine is a great software, but I encountered some problems: I have SVs of 33 samples, and I have merged the SVs of these samples using the following parameters:

jasmine file_list=test.vcflist out_file=out.vcf max_dist=200 kd_tree_norm=2 min_seq_id=0.9 min_support=1 max_dup_length=100k min_overlap=0.5 --output_genotypes --nonlinear_dist

But I found these two strange INS in the results.

image

In both INSs, the insertion sequences of ALT is very similar (whether observed or aligned by NW-align). However, it is confusing that the two SVS are not merged into one.

wjwei-handsome commented 1 year ago

I note that, by default, jasmine uses jaccard similarity to evaluate the similarity between two sequences, from the [code repository] (https://github.com/mkirsche/Jasmine/blob/7da94a29934a44d6d8c45d5d97e84fa53b1e5918/src/Variant.java#L113-L120).

I simulated two sequences of 50bp length, there are 5 SNPs between them:

CACCCTCATGCATGCAACGAGTATGCATCGATGCAGCGAGCAGCATCT
CACCCGCATGCATGCCACGAGTATTCATCGATGAAGCGAGCATCATCT

I use python to rewrite the jaccar similarity evaluation method and the kmerCount method reference to the code repository.

def countKmers(sequence: str, k: int) -> dict:
    """
    Gets the frequency of each k-mer in a string, skipping over non-base characters
    """
    KmerCount = {}

    baseCount = 0
    kmer = 0
    for i in range(len(sequence)):
        c = sequence[i]

        base = -1
        if c == 'A' or c == 'a':
            base = 0
        elif c == 'C' or c == 'c':
            base = 1
        elif c == 'G' or c == 'g':
            base = 2
        elif c == 'T' or c == 't':
            base = 3

        if base != -1:
            allButTwoHighest = ((1 << (2 * (k - 1))) - 1) & kmer
            kmer = (allButTwoHighest << 2) | base
            baseCount += 1

            if baseCount >= k:
                if kmer in KmerCount:
                    KmerCount[kmer] += 1
                else:
                    KmerCount[kmer] = 1
    return KmerCount

def jaccardSimilarity(seq1, seq2, k):
    sKmerFreq = countKmers(seq1, k)
    tKmerFreq = countKmers(seq2, k)

    intersection = 0
    union = 0

    for sKmer,sFreq in sKmerFreq.items():
        tFreq = tKmerFreq.get(sKmer, 0)
        intersection += min(sFreq, tFreq)
        union += max(sFreq, tFreq)

    for tKmer,tFreq in tKmerFreq.items():
        if tKmer not in sKmerFreq:
            union += tFreq

    return 1.0 * intersection / union

The jaccard similarity of the two sequences was calculated to 0.778.

I also used python to implement the edit distance similarity calculation method in the code repository.

def editDistanceSimilarity(seq1, seq2):
    m = len(seq1)
    n = len(seq2)
    dp = [[0 for x in range(n + 1)] for x in range(m + 1)]
    for i in range(m + 1):
        for j in range(n + 1):
            if i == 0:
                dp[i][j] = j    # Min. operations = j
            elif j == 0:
                dp[i][j] = i    # Min. operations = i
            elif seq1[i-1] == seq2[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1 + min(dp[i][j-1],        # Insert
                                   dp[i-1][j],        # Remove
                                   dp[i-1][j-1])    # Replace
    return 1.0 * (m + n - dp[m][n]) / (m + n)

and the similarity of the edit distance between the two is 0.994. So, Is it more accurate to use edit distance similarity to estimate sequence similarity for DNA sequences ? Although its algorithm complexity is much higher than jaccard, there are obvious disadvantages in speed and performance.

And as I write this comment, Jasmine reported java.lang.OutOfMemoryError. 😵‍💫