vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Alignment identity doesn't make any sense, ignores deletions #472

Open adamnovak opened 8 years ago

adamnovak commented 8 years ago

Alignment identity is calculated by identity(const Path& path), which divides the match count by path_to_length(path). This latter function returns the total to_length of all the Mappings, and the net result is that the reported identity is going to ignore deletions relative to the reference.

For example:

echo '{"node":[{"id": 1, "sequence": "ACACACACACACACTGCGCGCGCGCG"}]}' | vg view -Jv - | vg align -s ACACACACACACACGCGCGCGCGCG -j -
{"path": {"mapping": [{"edit": [{"from_length": 14, "to_length": 14}, {"from_length": 1}, {"from_length": 11, "to_length": 11}], "position": {"node_id": 1}, "rank": 1}]}, "sequence": "ACACACACACACACGCGCGCGCGCG", "score": 19, "identity": 1.0}

Note the {"from_length": 1} edit and the "identity": 1.0.

A better measure of "identity" is probably matches over total alignment columns.

ekg commented 8 years ago

Deletions are complicated. Maybe we can calculate them when the xg index with positional paths is available?

On Thu, Sep 8, 2016, 23:43 adamnovak notifications@github.com wrote:

Alignment identity is calculated by identity(const Path& path), which divides the match count by path_to_length(path). This latter function returns the total to_length of all the Mappings, and the net result is that the reported identity is going to ignore deletions relative to the reference.

For example:

echo '{"node":[{"id": 1, "sequence": "ACACACACACACACTGCGCGCGCGCG"}]}' | vg view -Jv - | vg align -s ACACACACACACACGCGCGCGCGCG -j - {"path": {"mapping": [{"edit": [{"from_length": 14, "to_length": 14}, {"from_length": 1}, {"from_length": 11, "to_length": 11}], "position": {"node_id": 1}, "rank": 1}]}, "sequence": "ACACACACACACACGCGCGCGCGCG", "score": 19, "identity": 1.0}

Note the {"from_length": 1} edit and the "identity": 1.0.

A better measure of "identity" is probably matches over total alignment columns.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EVC6a9X0he9KzAsE69eTY183d29kks5qoIGagaJpZM4J4fAd .

edawson commented 8 years ago

A better measure of "identity" is probably matches over total alignment columns.

This seems more intuitive to me as well. Am I missing something really simple as to why we don't do this?

ekg commented 8 years ago

Yes. We need to load the entire graph in the region of the alignment to do this.

On Fri, Sep 9, 2016, 15:48 Eric T. Dawson notifications@github.com wrote:

A better measure of "identity" is probably matches over total alignment columns.

This seems more intuitive to me as well. Am I missing something really simple as to why we don't do this?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245917895, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EX0v4ZgYHUYNQRvBrxYPVcmbEjohks5qoWOkgaJpZM4J4fAd .

glennhickey commented 8 years ago

As far as I understand, what Adam and Eric are suggesting is just to incorporate from_lengths from deletion edits into the total_length count in identity(), which is doable without loading extra stuff. This way insertions and deletions get counted symmetrically against the identity of the mapping..

On Fri, Sep 9, 2016 at 9:51 AM, Erik Garrison notifications@github.com wrote:

Yes. We need to load the entire graph in the region of the alignment to do this.

On Fri, Sep 9, 2016, 15:48 Eric T. Dawson notifications@github.com wrote:

A better measure of "identity" is probably matches over total alignment columns.

This seems more intuitive to me as well. Am I missing something really simple as to why we don't do this?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245917895, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAI4EX0v4ZgYHUYNQRvBrxYPVcmbEjohks5qoWOkgaJpZM4J4fAd

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245918757, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7iiNj7MLzmo2Ga9eNnWIQDSNyxICks5qoWRngaJpZM4J4fAd .

ekg commented 8 years ago

The problem I'm referring to is that sometimes deletions are represented as "split" alignments without the corresponding edit. We go from one mapping to another one at the other side of the deletion. In order to know how far we've gone we will need to load and examine the graph.

Identity as a fraction of the bases in the read that exactly match is easy to calculate. The deleted bases don't exist from the frame of the read. But, this makes ins and del asymmetric, which is problematic.

The solution would be to pick up the closest path through the graph matching the alignment and calculate overlap(aln1, aln2) on that. This could be easy as long as we can bail out when we can't quickly find such a path.

On Fri, Sep 9, 2016, 16:56 Glenn Hickey notifications@github.com wrote:

As far as I understand, what Adam and Eric are suggesting is just to incorporate from_lengths from deletion edits into the total_length count in identity(), which is doable without loading extra stuff. This way insertions and deletions get counted symmetrically against the identity of the mapping..

On Fri, Sep 9, 2016 at 9:51 AM, Erik Garrison notifications@github.com wrote:

Yes. We need to load the entire graph in the region of the alignment to do this.

On Fri, Sep 9, 2016, 15:48 Eric T. Dawson notifications@github.com wrote:

A better measure of "identity" is probably matches over total alignment columns.

This seems more intuitive to me as well. Am I missing something really simple as to why we don't do this?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245917895, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAI4EX0v4ZgYHUYNQRvBrxYPVcmbEjohks5qoWOkgaJpZM4J4fAd

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245918757, or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2_7iiNj7MLzmo2Ga9eNnWIQDSNyxICks5qoWRngaJpZM4J4fAd

.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245937977, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EczPiHzy8MEha5rbt60CvjmZcZb8ks5qoXOsgaJpZM4J4fAd .

glennhickey commented 8 years ago

Got it. I may check to see if including edit deletions in our use of identity (just cutting off all mappings at 90) has any effect for the heck of it.

On Fri, Sep 9, 2016 at 11:07 AM, Erik Garrison notifications@github.com wrote:

The problem I'm referring to is that sometimes deletions are represented as "split" alignments without the corresponding edit. We go from one mapping to another one at the other side of the deletion. In order to know how far we've gone we will need to load and examine the graph.

Identity as a fraction of the bases in the read that exactly match is easy to calculate. The deleted bases don't exist from the frame of the read. But, this makes ins and del asymmetric, which is problematic.

The solution would be to pick up the closest path through the graph matching the alignment and calculate overlap(aln1, aln2) on that. This could be easy as long as we can bail out when we can't quickly find such a path.

On Fri, Sep 9, 2016, 16:56 Glenn Hickey notifications@github.com wrote:

As far as I understand, what Adam and Eric are suggesting is just to incorporate from_lengths from deletion edits into the total_length count in identity(), which is doable without loading extra stuff. This way insertions and deletions get counted symmetrically against the identity of the mapping..

On Fri, Sep 9, 2016 at 9:51 AM, Erik Garrison notifications@github.com wrote:

Yes. We need to load the entire graph in the region of the alignment to do this.

On Fri, Sep 9, 2016, 15:48 Eric T. Dawson notifications@github.com wrote:

A better measure of "identity" is probably matches over total alignment columns.

This seems more intuitive to me as well. Am I missing something really simple as to why we don't do this?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245917895, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAI4EX0v4ZgYHUYNQRvBrxYPVcmbEjohks5qoWOkgaJpZM4J4fAd

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245918757, or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2_ 7iiNj7MLzmo2Ga9eNnWIQDSNyxICks5qoWRngaJpZM4J4fAd

.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245937977, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAI4EczPiHzy8MEha5rbt60CvjmZcZb8ks5qoXOsgaJpZM4J4fAd .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245941131, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7vOMMQ7GeYJvLdvIatl-pesha8Vfks5qoXYkgaJpZM4J4fAd .

ekg commented 8 years ago

After some thought, I'm with the idea of fixing this! It will be a little complicated to do but should be fine.

adamnovak commented 8 years ago

I think identity should ignore alignment splits. You might have a split alignment where you go backwards, and have no path between where your first half ended and where your second half began.

Even when we can calculate it, we don't want super-tiny identities for perfect alignments that just happen to be split, do we?

Or is the aligner articulating normal, ordinary, non-banded-alignment-structural-rearrangement deletions by skipping nodes? Because if that's the case I think I need to remove some of the sanity checks I've put into vg filter that complain when reads take nonexistent edges.

On Fri, Sep 9, 2016 at 10:53 AM, Erik Garrison notifications@github.com wrote:

After some thought, I'm with the idea of fixing this! It will be a little complicated to do but should be fine.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/472#issuecomment-245988994, or mute the thread https://github.com/notifications/unsubscribe-auth/AE0_XwhdKWNZT52b1SQd78RZw1g1Kg5bks5qoZ0bgaJpZM4J4fAd .

ekg commented 8 years ago

@adamnovak at present the default aligner won't do that, but the banded one will, and the MEM-threaded one is likely going to produce alignments that look like this. Also, some normalizations of alignments might produce "node skips" when we really might say that our "from_length" is greater than our "to_length".

I feel like we need a metric robust to these effects. Maybe we have to look at the graph.