tseemann / samclip

Filter SAM file for soft and hard clipped alignments
GNU General Public License v3.0
46 stars 10 forks source link

Clipping issue at contig ends #10

Closed tseemann closed 4 years ago

tseemann commented 4 years ago

Hi Torsten,

I have been using your samclip tool to filter some of my bamfiles and its been incredible useful and to my knowledge the only tool that can do something like this. I am mapping my reads to short sequence capture contigs and thus softclipping the reads is essential, but having that softclipping in the middle of the reads can give bad alignments.

I noticed that samclip allows the softclipping at the end of a contig, regardless on the "side" of the softclipping. E.g. see the attached screenshot. Is it possible to allow it only to keep clipped reads on the "outside" part of the read?

And is it possible to make it work on both the start and end of the reference? For me it appears to only work when the reads are clipped at the end of the reference.

I took a look at the script but im afraid my perl knowledge is not enough to be able update the script accordingly :). If you have any tips on how I can get samclip to work for both those situations as well that would be great.

Thank you!

Kevin

image

tseemann commented 4 years ago

So it seems to keep clipped reads that overhang the END of a contig but discard those overhanging the START of a contig

I think i have a bug where i check for START==0 but actually SAM uses 1-based coordinates!

# this is discarded (incorrectly)  5' end
[samclip] CHROM=contig02:1-1610 POS=1..250 CIGAR=14S236M HL=0 SL=14 SR=0 HR=0 max=5)

# this is kept (correctly)   3' end
[samclip] CHROM=contig02:1-1610 POS=1424..1673 CIGAR=187M63S HL=0 SL=0 SR=63 HR=0 max=5)
tseemann commented 4 years ago

I have released 0.3.0

kvpmulder commented 4 years ago

Hi Torsten,

The first issue can be solved by changing line 117 to this

unless ($start == 1 && $SR < 1 or $end >= $contiglen && $SL < 1) {

This will only allow clipped reads at the reference ends that have clippings at the "correct" side. I guess that 1 can also be changed to $max to be more in line with the rest of the script.

Thanks,

Kevin

tseemann commented 4 years ago

the double-negative logic i am using it doing my head in :)

if start==1 or end=END (the aligned section, not counting clips, always in fwd direction) then i keep the alignment no matter what

if it's a normal in the middle of a contig (not abutting with a contig end) then i remove it if it has any clipping beyond max.

OK. i see what you are saying now. writing it out helped. If it is start=1 but still clipped on the right, then we still want to get rid of it, but if it was on the left we don't. and if end=END and its clipped on the left, we dont want it, but clip on the right is ok.

i think i need to use $max to maintain consistencty. but i also have SR and HR i need to contend with too. cue thinking music :musical_note:

tseemann commented 4 years ago

FYI - @kvpmulder if i ever write this up as a paper, you will be a co-author now.

tseemann commented 4 years ago

@kvpmulder what about this revised logic flow?

    my $L = $HL + $SL;
    my $R = $HR + $SR;
    $L = 0 if $start <= 1+$L;
    $R = 0 if $end >= $contiglen-$R;
    if ($L > $max or $R > $max) {
      $removed++;
      next;
    }
kvpmulder commented 4 years ago

Hi Torsten,

I checked it on my data and this produces the same correct results and indeed does not have the double negative in the code. I am personally not opposed to the double negative, but I think your solution is easier to follow :).

Kevin

tseemann commented 4 years ago

@kvpmulder really appreciate your input here! I will release 0.4