sofpn / rprimer

3 stars 1 forks source link

Confusion regarding identity and coverage column and their relevance #3

Open Rohit-Satyam opened 1 year ago

Rohit-Satyam commented 1 year ago

Hi

I found this R package quite handy after writing my own custom functions which were otherwise verbose. I have begun to like using this package but I have queries related to identity and coverage columns, their relevance and how they are calculated. In the documentation you say:

Identity represents the proportion of sequences, among all sequences with a DNA base (A, C, G, T) or gap (-) that has the majority consensus base.

This statement is unclear to me since it seems to suggest that Identity is nothing but the proportion of nucleotide with maximum frequency which should be calculated as Identity=max(freq(A,T,G,C,-))/total sequences aligned which is other way of saying Identity=1-gaps column for nucleotides A,T,G,C and Identity=gaps for -

However, I manually calculated identity values and I realized that the Identity = (First Frequent Base or Gap) / (total sequences aligned - Second Frequent base or Gap) where :

#First Frequent Base or Gap
max_values <- apply(consensusMatrix(fasta), 2, max)

# Second Frequent base or Gap
second_max <- apply(consensusMatrix(fasta), 2, function(col) sort(col, decreasing = TRUE)[2])

Identity= max_values/(899 - second_max)

See the picture below. I am struggling to understand what is the relevance of this metric when we can simply use Identity=1-gaps column for nucleotides A,T,G,C and Identity=gaps for - to measure sequence conservation.

image

Also, it is unclear to me how the coverage is calculated and why some bases are being converted into IUPAC consensus characters for the following bases (encircled in blue). For generating profile I used:

myConsensusProfile <- consensusProfile(fasta, ambiguityThreshold = 0.05)

For row 7 (7th position in alignment) where only G and - passes this threshold so why the ambiguous base R is assigned? according to IUPAC names here, I think R should be assigned in a place where there is ambiguity between A or G. But frequency of A in 7th position is minuscule and does not exceed the threshold.

Rohit-Satyam commented 1 year ago

Hi @sofpn. Could you please clarify this since I am still trying to wrap my head around the query I raised?