sanger-pathogens / Bio-Tradis

A set of tools to analyse the output from TraDIS analyses
https://sanger-pathogens.github.io/Bio-Tradis/
Other
21 stars 29 forks source link

Possible issue with calculation of read start and ends in module Cigar.pm #120

Closed irilenia closed 3 years ago

irilenia commented 4 years ago

Hi, We attempted to use your pipeline to process the sequencing results of libraries that were produced using the Himar1 transposon. We were getting some strange results and whilst looking closer at the problem, it became obvious that a number of insertion sites were called at places that appeared to be wrong. The forward strand did no seem to be affected but many reads on the reverse strand were associated with insertion sites that were some short distance from the actual insertion site. After a bit of digging into the code I believe that the problem is in Cigar.pm.

To explain the issue as best as I could, I created a couple of slides which I have attached here as snapshots. In the first slide, you see at the top a snapshot from Artemis showing two larger peaks where the TA insertion is expected. The reads below are on the positive strand on the right and negative strand on the left. If the reads are counted, we find that the positive-strand reads add up correctly to the red peak but the negative strand reads do not add up to the blue peak (the picture has been cropped so you can't see all the reads but I can send you the original). However, reverse reads that match 100% to the reference do add up to the blue peak so something is not right with the reads that do not fully match the reference.

Screen Shot 2020-07-15 at 23 36 27

I think the problem is in the calculation within Cigar.pm. I have again annotated the code using information for this specific read I have highlighted above in red:

Screen Shot 2020-07-15 at 23 38 40

If I followed the calculation correctly (I admit to not have run this myself - I'm simply following the code in my head), then the start of this read is set to 1531 and the end to 1640. The program correctly uses the end of the read to call the insertion site (as the read is on the reverse strand) but 1640 is calculated wrongly, in my opinion. The blue peak at 1640 corresponds to this call and I'm unable to find any other read that could be responsible for it.

Given the evidence above, I think there is a problem with calculation of read starts and ends when soft-clipping happens and the cigar string starts with action "S" rather than action "M".

I saw that someone else had raised a similar issue before (although they didn't look into why it was happening - issue #86; now closed) and they were advised that this could be a result of polymerase slippage. There may well be cases like this as well in the data but I think there is an additional issue with how the read start/ends are calculated. In fact a reply to that issue explained that the problem goes away when the smalt-y parameter is set to 1 (which would mean that soft-clipped reads are not considered). If the sites were real sites of insertion (but produced by slippage), I don't see how this fix would have worked, although I admit I perhaps do not understand well the technology here.

I have little experience with Tn-seq but it seems to me that the problem wouldn't affect generally results from saturation libraries or libraries where insertion happens at random places in the genome. The insertion sites called are generally very close to the actual real site and so as long as they fall within a gene's boundaries, they would have been added to the correct gene anyway. However, for cases like ours, the calculation is problematic.

I'm sorry for the really long message but I wanted to explain as best as I can the situation. It is entirely possible that I'm wrong but looking at the code I believe there is a problem, just not one that would affect most people's calculations or that would even be visible in a situation where many insertion sites exist across the whole genome.

Thank you very much in advance, Irilenia

lbarquist commented 4 years ago

Hi Irilenia,

Thanks for the detailed report, I agree this looks like an issue.

Could you try inserting the following code in the S/D/N elsif (line 46 in Cigar.pm) and tell me if it fixes your issue with the reverse strand reads (I guess it's more widespread than just this one example)?:

$current_coordinate -= $number if($results{start} == 0);
$results{start} = $current_coordinate if($results{start} == 0);
$current_coordinate += $number;

This should reproduce your calculation for the (true) read start. It would also change the start position for some reads on the forward strand, but I guess soft clipping here is less likely than at the end.

-Lars

irilenia commented 3 years ago

Screen Shot 2020-07-23 at 17 04 23

Thank you for fixing this. I can confirm that this fixes the problem of the small peaks caused by soft-clipped reads. See snapshot of the same region I had shown before for confirmation.

A NOTE FOR HIMAR1 USERS: The plot also suggests that two insertion sites are called (not sure if this is fixed further down but I doubt it based on the stats we see at the end) whereas they really are the same insertion site (with reads being generated by the inverted repeat in the transposon). Thus, care should be taken when using this software to interpret Himar1-produced libraries.