How to determine SNP frequency in multi-copy genes (e.g., 23S rRNA).

slvrshot commented 1 year ago

Hello Martin,

In your paper it is stated that ARIBA...."4) identifies SNP frequency in multicopy genes, which has been traditionally difficult to resolve due to the complexities of de novo assembly".

Do you mean you are able to detect how many copies (let's use 23S rRNA [4 copies] in this example) are actually mutated? How is this reported? And how is ARIBA able to do this? I have seen others employ custom scripts (usually involving breseq or bwatools) that involve mapping reads to a single copy of the gene and then determining the frequency of that base found in all the reads that align to a particular SNP position.

Thanks!

martinghunt commented 1 year ago

It can't tell how many copies there are. If there are differences between the copies then it can give an estimate of the ratio of the copies...

It looks at pileup from the reads mapped to the assembled contig. Unless the copies are identical, in the report file (https://github.com/sanger-pathogens/ariba/wiki/Task:-run#report-file) you'll see "heterozygous" snps. The smtls_nts column will have more than one nucleotide, and then the corresponding depths of each nucleotide are in the smtls_nts_depth column. You'll probably also see that the flag (https://github.com/sanger-pathogens/ariba/wiki/Task:-flag) has variants_suggest_collapsed_repeat.

Also, multiple copies can be flagged because their flanking sequences will be different. In this case, the flag will have scaffold_graph_bad (because it's ambiguous which contig ends can link together).

slvrshot commented 1 year ago

Martin thanks for the explanation. So to be clear it can give an estimate of the ratio of copies with a particular mutation?

For example I have the following...I am interested in 2045G. The mutation is detected. So here G,A, are represented meaning that the copies are "heterozygous". And the depth of G is 703, and 2 for A. (See below)

703/722 = 0.937 2/722 = 0.0027

I would gather that based on smtls_total_depth that we can estimate all of the copies are likely mutated with this particular mutation? Does this sound correct? Thanks again for developing and supporting this tool.

sanger-pathogens / ariba

How to determine SNP frequency in multi-copy genes (e.g., 23S rRNA). #337