sanger-pathogens / Bio-Tradis

A set of tools to analyse the output from TraDIS analyses
https://sanger-pathogens.github.io/Bio-Tradis/
Other
21 stars 29 forks source link

Possible bug handling reads with clipping when converting mapped reads to insertion plot #135

Open emgemg opened 5 months ago

emgemg commented 5 months ago

I recently installed biotradis using conda

Name Version Build Channel biotradis 1.4.5 hdfd78af_5 bioconda

I noticed that I had an issue with reads with clipping (which I think you've seen before) where reads with soft clipping at the 5' end take the first mapped nucleotide as the insertion site, resulting in an artificially inflated "unique insertions" count. I can see the fix in the Cigar.pm script,

Screenshot 2024-01-25 at 3 32 41 pm

but found that this fix wasn't in the conda-installed hdfd78af_5 build. I tried a few conda envs with alternate versions with no success, but possibly some user error on my part somewhere, so in any case I downloaded Bio-Tradis-master directly from github to test again.

This time with the correct Cigar.pm, I have a different error. When I view the plot file in artemis, it looks like all insertion sites have been artificially shifted by a consistent factor, while the bam file viewed in IGV is unchanged Screenshot 2024-01-25 at 3 40 17 pm I think (but I'm not familiar enough with perl to be sure I've understood the code correctly) the issue is I have reads mapping across the annotation origin, such that their cigar is something like "77H32M". Because these are the first reads encountered, they seem to offset the rest of the data by "77". I wondered if the issue is here

$current_coordinate += $number;

In any case, I wondered if there might be an option to exclude all reads with clipping (hard and soft)? I know it's very conservative, but even when I don't have the issue with annotation-origin-spanning reads, I found that the following

                                $current_coordinate -= $number if($results{start} == 0);
                                $results{start} = $current_coordinate if($results{start} == 0);
                $current_coordinate += $number;
                                $results{end} = $current_coordinate -1 if($results{end} < $current_coordinate);

reported false insertion sites for "chimeric reads". As in, if a read had the cigar 40S71M, the insertion coordinate would be adjusted by 40 to give a tn-insertion event 40bp upstream, yet when I look at the individual read it's not a real transposon-junction, and I'm still getting artificially inflated insertion reporting. (This particular library is intentionally not very diverse, so doesn't need much sequencing coverage, but as a consequence is more susceptible to "noise")

I hope this makes sense, I'd be happy to try and explain some more. Maybe this is just specific to my data(?) but posting here just in case. Thanks

lbarquist commented 5 months ago

Hi,

So, there are some issues with handling soft-clipping, and this has been exacerbated by the switch to BWA. I would try using the original default mapper, smalt (--smalt), and setting --smalt_y 1; this should filter out anything that doesn't have an exact match in the genome and will hopefully resolve the problem you're having. It may be worth doing some adapter/quality trimming if you're getting low mapping numbers.

-Lars