I found this R package quite handy after writing my own custom functions which were otherwise verbose. I have begun to like using this package but I have queries related to identity and coverage columns, their relevance and how they are calculated. In the documentation you say:
Identity represents the proportion of sequences, among all sequences with a DNA base (A, C, G, T) or gap (-) that has the majority consensus base.
This statement is unclear to me since it seems to suggest that Identity is nothing but the proportion of nucleotide with maximum frequency which should be calculated as Identity=max(freq(A,T,G,C,-))/total sequences aligned which is other way of saying Identity=1-gaps column for nucleotides A,T,G,C and Identity=gaps for -
However, I manually calculated identity values and I realized that the Identity = (First Frequent Base or Gap) / (total sequences aligned - Second Frequent base or Gap) where :
#First Frequent Base or Gap
max_values <- apply(consensusMatrix(fasta), 2, max)
# Second Frequent base or Gap
second_max <- apply(consensusMatrix(fasta), 2, function(col) sort(col, decreasing = TRUE)[2])
Identity= max_values/(899 - second_max)
See the picture below. I am struggling to understand what is the relevance of this metric when we can simply use Identity=1-gaps column for nucleotides A,T,G,C and Identity=gaps for - to measure sequence conservation.
Also, it is unclear to me how the coverage is calculated and why some bases are being converted into IUPAC consensus characters for the following bases (encircled in blue). For generating profile I used:
For row 7 (7th position in alignment) where only G and - passes this threshold so why the ambiguous base R is assigned? according to IUPAC names here, I think R should be assigned in a place where there is ambiguity between A or G. But frequency of A in 7th position is minuscule and does not exceed the threshold.
Hi
I found this R package quite handy after writing my own custom functions which were otherwise verbose. I have begun to like using this package but I have queries related to
identity
andcoverage
columns, their relevance and how they are calculated. In the documentation you say:This statement is unclear to me since it seems to suggest that Identity is nothing but the proportion of nucleotide with maximum frequency which should be calculated as
Identity=max(freq(A,T,G,C,-))/total sequences aligned
which is other way of sayingIdentity=1-gaps
column for nucleotidesA,T,G,C
andIdentity=gaps
for-
However, I manually calculated identity values and I realized that the
Identity = (First Frequent Base or Gap) / (total sequences aligned - Second Frequent base or Gap)
where :See the picture below. I am struggling to understand what is the relevance of this metric when we can simply use
Identity=1-gaps
column for nucleotidesA,T,G,C
andIdentity=gaps
for-
to measure sequence conservation.Also, it is unclear to me how the coverage is calculated and why some bases are being converted into IUPAC consensus characters for the following bases (encircled in blue). For generating profile I used:
For row 7 (7th position in alignment) where only
G
and-
passes this threshold so why the ambiguous baseR
is assigned? according to IUPAC names here, I thinkR
should be assigned in a place where there is ambiguity betweenA or G
. But frequency of A in 7th position is minuscule and does not exceed the threshold.