mcmero / SVclone

A computational method for inferring the cancer cell fraction of tumour structural variation from whole-genome sequencing data.
BSD 3-Clause "New" or "Revised" License
40 stars 10 forks source link

Not collecting all span evidence available #31

Closed surajraj99 closed 7 months ago

surajraj99 commented 9 months ago

Hi all,

I'm trying to run svclone on the output of an in-house variant calling algorithm which doesn't have the most precise breakends. So for example, in the following DUP, the breakends defined by the SV caller are at 19158010 (pos1) and 19167044 (pos2). When I run svclone on this I get that there is 1 spanning call. Even when I set the "wobble" threshold to 1000, I still get only 1 spanning call.

On the other hand, if I redefine the breakpoints to 19158400 (pos1) and 19167140 (pos2), which is a lot more precise, svclone gives me 14 spanning calls and thus 14 support.

Most of the variant calls from my caller are not precise, they have 500 - 1000 bp discrepancies on both ends. I would like svclone to take account of that. Was the "wobble" threshold in the config not the right parameter to change for this? If not, how should I account for this?

Image of variant below in IGV as well as sv input file for sv clone.

Screenshot 2024-01-16 at 6 23 22 PM

Screenshot 2024-01-16 at 6 31 14 PM

mcmero commented 9 months ago

The threshold (wobble) parameter was intended for small variation in soft-clip read boundaries. The issue with the spanning read count in this case is that your spanning reads will fail to point towards the break (as the breakpoint will be behind the respective read). It does make sense however that the spanning read count could take into account the threshold. I have added a commit to the dev branch (https://github.com/mcmero/SVclone/commit/53c89af206aeae44f2f4fb38ba07baff8e870ef9) that handles this. Can you test it out and see if it solves your issue?

surajraj99 commented 9 months ago

The issue still remains unfortunately. With the original breakends, I still get only 1 supporting call when I set the wobble threshold to 1000. Here are some of the output files too for reference.

test_svclone_svin.txt test_svclone_svinfo.txt

surajraj99 commented 9 months ago

There are two things I observed. I edited the code in count.py on my side temporarily to this:

`

def is_supporting_spanning_pair(read,mate,bp1,bp2,inserts,max_ins,threshold): pos1 = (bp1['start'] + bp1['end']) / 2 pos2 = (bp2['start'] + bp2['end']) / 2

#ensure this isn't just a regular old spanning pair
if read['chrom']==mate['chrom']:
    if read['ref_start']<mate['ref_start']:
        if mate['ref_start']-read['ref_end'] < max_ins: return False
    else:
        if read['ref_start']-mate['ref_end'] < max_ins: return False

#check read orientation
#spanning reads should always point towards the break
if not points_towards_break(read,pos1,threshold) or not points_towards_break(mate,pos2,threshold):
    return False

ins_dist1 = get_bp_dist(read,pos1)
ins_dist2 = get_bp_dist(mate,pos2)

return True # skipping the rest of this function

if is_supporting_split_read_lenient(read,pos1,threshold):
    if not is_below_sc_threshold(mate,threshold):
        #only allow one soft-clip
        return False
    if abs(ins_dist1)+abs(ins_dist2) < max_ins:
        return True
elif is_supporting_split_read_lenient(mate,pos2,threshold):
    if not is_below_sc_threshold(read,threshold):
        return False
    if abs(ins_dist1)+abs(ins_dist2) < max_ins:
        return True
else:
    if ins_dist1>=-threshold and ins_dist2>=-threshold and abs(ins_dist1)+abs(ins_dist2) < max_ins:
        return True

return False

def get_spanning_counts(reproc,rc,bp1,bp2,inserts,min_ins,max_ins,threshold): pos1 = (bp1['start'] + bp1['end']) / 2 pos2 = (bp2['start'] + bp2['end']) / 2

reproc = np.sort(reproc,axis=0,order=['query_name','ref_start'])
reproc = np.unique(reproc) #remove dups
anomalous = np.empty([0,len(dtypes.read_dtype)],dtype=dtypes.read_dtype)
span_bp1 = np.empty([0,len(dtypes.read_dtype)],dtype=dtypes.read_dtype)
span_bp2 = np.empty([0,len(dtypes.read_dtype)],dtype=dtypes.read_dtype)

counts_0 = 0
counts_1 = 0
counts_2 = 0
counts_3 = 0
for idx,x in enumerate(reproc):
    if idx+1 >= len(reproc):
        break
    if reproc[idx+1]['query_name'] != reproc[idx]['query_name']:
        #not paired
        continue
    mate = np.array(reproc[idx+1],copy=True)
    r1 = np.array(x,copy=True)
    r2 = np.array(mate,copy=True)
    #if read corresponds to bp2 and mate to bp1, switch their order
    if (bp1['chrom']!=bp2['chrom'] and r1['chrom']==bp2['chrom']) or \
        (pos1 > pos2 and bp1['chrom']==bp2['chrom']):
        r1 = mate
        r2 = np.array(x,copy=True)
    counts_0 += 1
    if is_supporting_spanning_pair(r1,r2,bp1,bp2,inserts,max_ins,threshold):
        counts_1 += 1
        if bp1['dir'] in ['+','-'] and bp2['dir'] in ['-','+']:
            counts_2 += 1
            if validate_spanning_orientation(bp1,bp2,r1,r2):
                counts_3 += 1
                span_bp1 = np.append(span_bp1,r1)
                span_bp2 = np.append(span_bp2,r2)
                rc['spanning'] = rc['spanning']+1
            else:
                anomalous = np.append(anomalous,r1)
                anomalous = np.append(anomalous,r2)
    else:
        anomalous = np.append(anomalous,r1)
        anomalous = np.append(anomalous,r2)
print(counts_0, counts_1,counts_2,counts_3)
return rc,span_bp1,span_bp2,anomalous

`

When I set the threshold to 6, the statement "print(counts_0, counts_1,counts_2,counts_3)" gives (139, 4, 4, 1).

When I set the threshold to 150, the statement gives (253, 11, 11, 1)

When I set the threshold to 1000, the statement gives (625, 14, 14, 1).

Does this give some insights on how I can change the code to fit our use?

mcmero commented 9 months ago

Thanks for the info. This would suggest that the validate_spanning_orientation function doesn't take into account the threshold. Would it be possible for you to share a bam file with the reads from this SV so that I can test this?

surajraj99 commented 8 months ago

Sorry for getting back late on this, the bam file I'm using is from seqc2 consortium, here: https://www.ncbi.nlm.nih.gov/sra/SRX4728475[accn

mcmero commented 7 months ago

Sorry for the late reply. I've had another look at this and have added some code to the e2b8566 dev commit that would likely handle spanning reads from "wobbly" breaks. However, I would recommend recalibrating your SV calls instead to be as accurate as possible. SVclone was designed to handle only small deviations in breakpoint locations, and large deviations will likely lead to inaccurate results or bugs.