Returned MSA optionally includes the depths and errors.

nh13 commented 6 years ago

The depths and errors are returned as comma separated strings of values. For each base in the consensus sequence, the depth is the number of non-N and non-deletion (i.e. not 'N' and not '-') bases that cover/align that cover the consensus base across the non-consensus sequences. For each base in the consensus sequence, the number of errors are the bases from non-consensus sequences (same as the depth calculation) that mismatch or differ from the consensus sequence.

For example, consider a MSA with the following (consensus on the bottom):

CCAAAAAA-AAATTTCT
CCAAAAAA-AAATTTTT
--AAAAAACAAATTTTT
CCAAAAAA-AAATTTTT

The depths and errors would be

2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0

Three things to note:

The first sequence has an error relative ot the consensus, and we see that counted in the second to last value in the error string.
No depth is counted from the third sequence in the first two bases of the consensus.
There is no depth or error value for a deletion in the consensus (i.e. when the consensus MSA string has a '-', in this case due to the third sequence having an extra 'C').

nh13 commented 6 years ago

@rvaser I can certainly compute depth and errors from the output of the MSA, but rather than re-parsing the MSA strings, I would like to have spoa compute these on the fly.

rvaser commented 6 years ago

Hello Nils, would you mind if I change the API a bit? I would add an additional function generate_msa which takes two more vectors (or one with std::pair) as input parameters and fills them with depth and error values. How does that sound? Also, do you need this feature propagated to spoa executable or only for the library?

Best regards, Robert

rvaser commented 6 years ago

Also, I am not thrilled with the request for filtering N bases out of depth calculation becuase spoa can be run with any alphabet and might be used for proteins.

nh13 commented 6 years ago

@rvaser I'll certainly update the API as you suggest, though instead of depths and errors, would it then make sense to return a matrix instead, where say each column is a different base in the consensus sequence and each row is the # of observed sequences with that base? The # of rows would be the alphabet size plus one, with the latter added for the - character. Another suggestion would be to have a third array of values, namely the # of Ns observed. For my application, excluding no-calls (Ns) from the depth is important, so either way is fine by me.

rvaser commented 6 years ago

I fancy the matrix option! I will get to it right away, you don't have to trouble yourself implementing it. With the matrix, do you even need the MSA output or is the consensus sequence enough?

nh13 commented 6 years ago

@rvaser if I have the matrix, I don't need the MSA. And if I really wanted the MSA, which sometimes I do for manual debugging, I can just call generate_msa. Thank-you for being open to these changes and a pleasure to work with!

rvaser commented 6 years ago

I have implemented you request on branch feature_summary. You get the matrix by calling

std::string consensus = graph->generate_consensus(dst, true);

dst is a vector of uint32_ts which stores a matrix with graph->num_codes() + 1 rows and consensus.size() columns. Last row is reserved for indels -. You can iterate through the matrix for all codes stored in graph with

for (uint32_t i = 0; i < graph->num_codes(); ++i) {
    for (uint32_t j = 0; j < consensus.size(); ++j) {
        dst[i * consensus.size() + j];
    }
}

Note that - is not included in alphabet. Access the last row with

dst[graph->num_codes() * consensus.size()]

Alternatively, you can iterate over characters by calling graph->coder(c) to get the appropriate code.

If you call generate_consensus without second argument or pass false, dst.size() will be equal to consensus.size() and dst will contain base depths which equal to all bases covering a particular position in consensus.

Please try out the code and report back if you need anything else. Afterwards, I will merge the branch to master.

P.S. Wouldn't a transposed matrix be more cache friendly and easier to use in calculation of depth/error/quality?

nh13 commented 6 years ago

@rvaser yes, the transpose would actually be better, my apologies, as I am so used to thinking about alignments where the alignment string goes across the columns

nh13 commented 6 years ago

@rvaser I tested the code (without the transpose) and this works really well. For now, I don't think I want the transpose as I am printing the results in a human-readable format, though I understand your point.

rvaser commented 6 years ago

Should I merge the new feature to master as is?

nh13 commented 6 years ago

Works for me. Thank you for so quickly implementing this!

rvaser commented 6 years ago

No problem. :) Its on master and I created a new release with this feature.

rvaser / spoa

Returned MSA optionally includes the depths and errors. #5