Saturation analysis for parameters in STAR

kane9530 commented 1 year ago

Hi Oliver and team,

Thank you for working to develop this useful tool. I have a question regarding the results of a saturation analysis test on my dataset with the STAR aligner, as recommended in your TEtranscripts paper.

Specifically, I notice that whilst increasing the outFilterMultimapNmax and winAnchorMultimapNmax parameters reduces the % of multi-mapped reads discarded that map to multiple loci (panel B), it leads to a pronounced increase in the % of unmapped reads (panel D). In the x-axis of the figure below, the first number refers to the value for the outFilterMultimapNmax parameter, and the second number to the winAnchorMultimapNmax parameter.

aaturationAnalysis

My questions are :

Is this a known / expected behaviour? What is the reason for the increase in the % of unmapped reads?
I will proceed with analysing the aligned bam files with the 100_100 parameter samples with TEtranscripts based on the observation that the median % of mapped reads to too many loci is relatively low + median % of unmapped reads is relatively low + median % of uniquely mapped reads is relatively high. Would this make sense?

I understand that this is more of a STAR aligner related question, but I thought to seek your advice since this approach was suggested from the paper and I am using TETranscripts for repeats analysis. Thank you!

olivertam commented 1 year ago

Hi,

Thank you for your interest in the software. We have looked into this before, and had talked with the author of STAR about this. This is their response:

[With winAnchorMultimapNmax 100] Each anchor can map to no more than 100 loci, but multiple anchors can map to more than 100 loci, allowing for alignments with >100 loci. However, there is no guarantee that all alignments of a >100 multi-mapping read will be found, since anchors mapping to >100 loci are dropped. Increasing winAnchorMultimapNmax allows STAR to use shorter seed as anchors, which increases sensitivity for problematic alignments (with many/mismatches indels). I would recommend setting winAnchorMultimapNmax =2 * outFilterMultimapNmax, but no less than 50 - in your case outFilterMultimapNmax=100, so it would be 200. You can play with setting it higher - this, in theory, should give you more sensitivity for the multi-mapping reads, but the effect will probably be negligible. On the other hand, it may reduce the mapping speed significantly.

To address your questions: We suspect that the reason why we might have an increase in the percentage of unmapped reads is that with the increased number of anchors, the greater sensitivity (shorter seed) means that you would need a better alignment to confidently assign that read to that genomic location (whereas a longer seed might be more lenient). Again, this might be a question that the people at STAR might address better.

If you look at the 100_100 run, you would notice that it has a uniquely mappable percentage close to the 10_10 run. This would be quite unexpected, as we should be increasing the multimapping rate. In fact, we see a decrease in multimapping rate. This would suggest that the parameters that are being provided are not really contributing to getting multimappers, but might be erroneously causing things to be "uniquely" aligned even if they are not. Thus, I would be hesitant (in combination with our previous communication with the STAR author) to use that setting, and would probably recommend 100_200. The increase in "too many mapped reads" might be due to the seeds finding more matches, but I think the amount of "lost" reads is minimal.

Please let us know if that doesn't address your question.

Thanks.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

mhammell-laboratory / TEtranscripts

Saturation analysis for parameters in STAR #124