hsinnan75 / MapCaller

MapCaller – An efficient and versatile approach for short-read alignment and variant detection in high-throughput sequenced genomes
MIT License
29 stars 5 forks source link

Getting nucleotide counts at each position #37

Closed tseemann closed 4 years ago

tseemann commented 4 years ago

Currently only DP and AD is recorded, and AF is computed as AD/DP ?

If REF is A, does AD count the sum(C,G,T) ?

Can you report NUM_A, NUM_C, NUM_T, NUM_G for each site?

hsinnan75 commented 4 years ago

DP = sum(A,C,G,T) AD = ALT_depth I'll check if there are tags for those attributes.

tseemann commented 4 years ago

So:

REF = A
ALT = T
DP = 100  # number of accepted reads covering this site
AD = 90   # number of T at this site
AF = 0.9  # AD/DP ? or AD/num(REF=A) ?  this is what i am unsure about
hsinnan75 commented 4 years ago

In my definition, AF = AD/DP. I thought this was the formal definition. By the way, I could not find the flags for reporting the occurrences of every nucleotide. Do you have any idea on this?

tseemann commented 4 years ago

There are no existing VCF flags for reporting the nucleotide occurences that I know of.

I was thinking maybe they should be a LIST instead? eg NTFREQ=1,3,0,75 . ie. =A,C,G,T (alpha order) Might also want to know if any other IUPAC codes, like N or R (mostly N)

@andersgs suggested to me that in full count mode you could do this for an A=>C call

REF=A
ALT=C,T,G
DP=32
AC=31,0,1   # allele count in genotypes, for each ALT allele, in the same order as listed
GT=1        # ploidy 1 in this case (means 1st ALT allele, which is 'C' here)
AN=?        # AN : total number of alleles in called genotypes
andersgs commented 4 years ago

@tseemann that looks good to me. Just that typically the ALT field is not comma-separated but just a character string (e.g., CTG). Presumably, the ALT list could contain any IUPAC code. Although, other than N and the normal bases I don't see any of the others happening in the read data.

If you wanted to, you could use the EC field of the INFO column:

• EC : comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field (typically used in association analyses) (Integers)

But, then you are nesting data that one may not want to nest.

hsinnan75 commented 4 years ago

I added NTFREQ field in the VCF output. Every SNV and monomorphic entry will show the occurrence of every base at that position. Please update to v0.9.9.18.