Closed nh13 closed 6 years ago
@rvaser I can certainly compute depth
and errors
from the output of the MSA, but rather than re-parsing the MSA strings, I would like to have spoa
compute these on the fly.
Hello Nils,
would you mind if I change the API a bit? I would add an additional function generate_msa
which takes two more vectors (or one with std::pair) as input parameters and fills them with depth and error values. How does that sound? Also, do you need this feature propagated to spoa
executable or only for the library?
Best regards, Robert
Also, I am not thrilled with the request for filtering N
bases out of depth calculation becuase spoa
can be run with any alphabet and might be used for proteins.
@rvaser
I'll certainly update the API as you suggest, though instead of depths and errors, would it then make sense to return a matrix instead, where say each column is a different base in the consensus sequence and each row is the # of observed sequences with that base? The # of rows would be the alphabet size plus one, with the latter added for the -
character. Another suggestion would be to have a third array of values, namely the # of Ns observed. For my application, excluding no-calls (N
s) from the depth is important, so either way is fine by me.
I fancy the matrix option! I will get to it right away, you don't have to trouble yourself implementing it. With the matrix, do you even need the MSA output or is the consensus sequence enough?
@rvaser if I have the matrix, I don't need the MSA. And if I really wanted the MSA, which sometimes I do for manual debugging, I can just call generate_msa
. Thank-you for being open to these changes and a pleasure to work with!
I have implemented you request on branch feature_summary
. You get the matrix by calling
std::string consensus = graph->generate_consensus(dst, true);
dst
is a vector of uint32_t
s which stores a matrix with graph->num_codes() + 1
rows and consensus.size()
columns. Last row is reserved for indels -
. You can iterate through the matrix for all codes stored in graph with
for (uint32_t i = 0; i < graph->num_codes(); ++i) {
for (uint32_t j = 0; j < consensus.size(); ++j) {
dst[i * consensus.size() + j];
}
}
Note that -
is not included in alphabet. Access the last row with
dst[graph->num_codes() * consensus.size()]
Alternatively, you can iterate over characters by calling graph->coder(c)
to get the appropriate code.
If you call generate_consensus
without second argument or pass false
, dst.size()
will be equal to consensus.size()
and dst
will contain base depths which equal to all bases covering a particular position in consensus.
Please try out the code and report back if you need anything else. Afterwards, I will merge the branch to master
.
P.S. Wouldn't a transposed matrix be more cache friendly and easier to use in calculation of depth/error/quality?
@rvaser yes, the transpose would actually be better, my apologies, as I am so used to thinking about alignments where the alignment string goes across the columns
@rvaser I tested the code (without the transpose) and this works really well. For now, I don't think I want the transpose as I am printing the results in a human-readable format, though I understand your point.
Should I merge the new feature to master
as is?
Works for me. Thank you for so quickly implementing this!
No problem. :) Its on master
and I created a new release with this feature.
The depths and errors are returned as comma separated strings of values. For each base in the consensus sequence, the depth is the number of non-N and non-deletion (i.e. not 'N' and not '-') bases that cover/align that cover the consensus base across the non-consensus sequences. For each base in the consensus sequence, the number of errors are the bases from non-consensus sequences (same as the depth calculation) that mismatch or differ from the consensus sequence.
For example, consider a MSA with the following (consensus on the bottom):
The depths and errors would be
Three things to note: