BAM filtering query - ok to have multiple primary alignments in BAM file?

single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells

https://cellsnp-lite.readthedocs.io

Apache License 2.0

124 stars 11 forks source link

BAM filtering query - ok to have multiple primary alignments in BAM file? #39

Open lucygarner opened 2 years ago

lucygarner commented 2 years ago

Hi,

In your paper, you say that you "discard those reads with low alignment quality, including MAPQ < 20, aligned length < 30 nt and FLAG with UNMAP, SECONDARY, QCFAIL (and DUP if UMI is not applicable)."

I am using STAR for mapping and have set the parameter outSAMprimaryFlag as AllBestScore rather than the default OneBestScore. This means that all alignments with the best score are labelled as primary, rather than just one alignment being labelled as primary (and the rest as secondary).

Will this cause problems for cellsnp-lite or is it ok if there are multiple primary alignments?

Best wishes, Lucy

hxj5 commented 2 years ago

Hi Lucy,

Thanks for the question. Compared to the "single-primary" strategy, the "multi-primary" strategy could lead to the increase of allelic counts or the calling of new (extra) alleles and finally an incorrect genotype for certain query SNP, in both smart-seq and 10x data.

IMPO, the effect of "multi-primary" on cellsnp-lite depends on the fraction of the "extra" primary alignments in the bam file. I suppose it is generally ok to use the result if the fraction is low and specific SNPs are not the focus. Otherwise, re-alignment using "single-primary" strategy is recommended.

Best, Xianjie

lucygarner commented 2 years ago

Hi Xianjie,

I forgot to mention that this is bulk RNA-seq data where I am using the "multi-primary" strategy. Would you still recommend re-alignment using a "single-primary" strategy in this case, or do you think the effect will be reduced with bulk data? Also, I am only using cellsnp-lite for genotyping the donors to allow for genetic demultiplexing, so as long as I can distinguish between the donors, that is the main thing.

Best wishes, Lucy

hxj5 commented 2 years ago

Hi, the effect also exists in bulk data. In this case, demultiplexing would be less accurate with incorrect allelic counts or genotypes. You may have a quick check if the downstream demultiplexing works well when given the "multi-primary" alignments, or simply perform re-alignment using "single-primary" strategy instead.

lucygarner commented 2 years ago

Hi @hxj5,

I had a follow-on question. Since the "single-primary" strategy for mapping can arbitrarily select one alignment as the primary alignment (out of multiple alignments with the same mapping score), could this not equally cause errors in genotyping? Since by chance an incorrectly aligned read may be labelled as the "primary" alignment and you would be ignoring the "secondary" alignment that could be the true alignment?

Best wishes, Lucy

hxj5 commented 2 years ago

Hi Lucy,

Yes, the arbitrary labelling in "single-primary" could cause errors in genotyping. The issue is difficult to mitigate and its effect mainly depends on the quality of upstream sequencing and alignment and the nature of the region.

Best, Xianjie

lucygarner commented 2 years ago

Hi Xianjie,

Thank you - would you say these errors are likely to be less than with the "multiple-primary" strategy (where reads with equally high score are labelled as primary) or not much difference? i.e. I am trying to decide whether changing to a "single-primary" approach where the choice of primary is arbitrary is actually better than my current approach (multiple primary alignments with the same score)?

For bulk RNA-seq (no UMIs), do duplicate reads need to be marked with e.g. Picard so that they will be discarded? Or is duplicate marking performed internally?

Best wishes, Lucy

hxj5 commented 2 years ago

Hi Lucy,

IMPO, "single-primary" is generally better for genotyping than "multi-primary", though both have some issues as we discussed; in "single-primary", it would only use (one of) the "best" alignment(s) instead of possibly incorporating some alignments with low mapping quality, which I think should be good for genotyping.

For specific dataset, to determine which strategy is better, you may try both strategies with downstream analysis to compare final performance, if the dataset is small. Otherwise, i would recommend "single-primary".

The duplicate marking should be good for genotyping, we saw some examples in smart-seq data. Cellsnp-lite does not perform marking so an upstream tool is needed (e.g., Picard).

Best, Xianjie