vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

How do paths affect alignments? #2445

Open hgibling opened 5 years ago

hgibling commented 5 years ago

Hello,

Should embedded paths in a graph affect alignment of reads? I'm running into a situation where my alignments have lower scores when aligned to a graph with embedded paths than when aligned to the same graph with the paths removed, which seems a bit counter intuitive. This includes a non-branching graph of hg19--around 30% of simulated reads have different alignment scores when aligned to the hg19 graph with and without a path embedded, and the alignments are mostly lower when aligned to the graph with the path.

This is using vg version v1.9.0-188-ge43f8d6 "Miglionico"

ekg commented 5 years ago

Well, they will. But I'm surprised at this effect.

That's a pretty old vg version! Would you mind checking if you see the same effect in the current (or recent) versions?

ekg commented 5 years ago

To be clear, they are used for positional clustering. So in principle they should be improving the alignment, not hurting it. That's why the effect is surprising.

hgibling commented 5 years ago

With vg version v1.17.0 "Candida" (precompiled) it's much better: about 0.2% of simulated reads have different alignment scores when aligned to the hg19 graph with and without a path embedded. But aligning to the pathless graph is still resulting in higher scores.

ekg commented 5 years ago

This is expected with paired end mapping. How are you doing the alignment? The constrained mapping can result in lower scores.

hgibling commented 4 years ago

Paired end without changing any defaults: vg map -d $GRAPH -f $READS.r1.fq -f $READS.r2.fq

hgibling commented 4 years ago

Looking into this a bit more. Here's gam info for a particular read:

Aligned to graph:

105,105,105,110,110,105,105,0,0],"fragment_length_distribution":"5000:0:0:0:1",
{"sequence":"CCTCCTCACTCACCAGAGGACACACACAGGGGAGAAGCCCTATGTCTGCAGGGAGTGTGGGCGGGGCTTTAGCCGGCAGTCAGTCCTCCTCACTCACCAG","path":{"mapping":[{"position":{"node_id":"59","offset":"197"},"edit":[{"from_length":100,"to_le
ngth":100}],"rank":"1"}]},"name":"A_10198_10410_0:0:0_0:0:0_219/1","quality":"KCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoK
A==","score":93,"fragment_next":{"name":"A_10198_10410_0:0:0_0:0:0_219/2"},"identity":1,"secondary_score":[110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,
110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110
,110,110,110,110,110,110,110,110,110,110,110,110,110,110,93,110,110,110,110,93,93,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,110,93,93,110,93,110,110,110,110,110,110,110,110,
110,110,110,110,110,110,110,110,110,93,110,110,110,110,110,110,110,110,110,93,110,110,110,93,110,110,110,110,110,110,110,110,110,110,110,110,110,110,93,110,110,93,110,110,110,110,110,110,110,110,110,110,110,93,1
10,110,110,93,93,93,93,110,93,93,110,93,93,93,93,88,110,105,110,105,110,105,110,105,110,110,110],"time_used":1502337}

Aligned to hg19:

{"sequence":"CCTCCTCACTCACCAGAGGACACACACAGGGGAGAAGCCCTATGTCTGCAGGGAGTGTGGGCGGGGCTTTAGCCGGCAGTCAGTCCTCCTCACTCACCAG","path":{"mapping":[{"position":{"node_id":"11","offset":"197"},"edit":[{"from_length":100,"to_length":100}],"rank":"1"}]},"name":"A_10198_10410_0:0:0_0:0:0_219/1",
"quality":"KCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKA==","mapping_quality":60,"score":110,"fragment_next":{"name":"A_10198_10410_0:0:0_0:0:0_219/2"},"identity":1,"secondary_score":[93,0],"fragment_length_distribution":"5000:0:0:0:1","time_used":13852}

The secondary score vector is the alignment score for other mappings, correct? For the graph, why would the selected alignment be one with a score of 93 when there are several possibilities with a score of 110?

jeizenga commented 4 years ago

It could be that the higher-scoring alignments were incompatible with the fragment length distribution.

hgibling commented 4 years ago

I tried specifying expected fragment length distributions and that didn't make much of a difference.

One thing I noticed is that if I map a single paired-end read, vg gives me an alignment with a perfect score:

{"sequence":"GGTTTCAGTGTTAAATCAGATGTTATTACACACCAAAGGACACATACAGGGGAGAAGCTCTACGTCTGCAGGGAGTGTGGGCGGGGCTTTAGCTGGAAGT","path":{"mapping":[{"position":{"node_id":"103","offset":"9"},"edit":[{"from_length":100,"to_length":100}],"rank":"1"}]},"name":"A_10010_10242_0:0:0_0:0:0_9bf/1","quality":"KCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKA==","score":110,"fragment_next":{"name":"A_10010_10242_0:0:0_0:0:0_9bf/2"},"identity":1,"secondary_score":[110,110,110,93,0,0,0],"fragment_length_distribution":"5000:0:0:0:1","time_used":1630060}
{"sequence":"ACATAGGGCTTCTCCCCTGTGTGTGTCCTCTGGTGAGTGAGGAGGACTGACTGCCAGCTAAAGCCCCGCCCACACTCCCTGCAGACATAGGGCTTCTCCC","path":{"mapping":[{"position":{"node_id":"103","offset":"304","is_reverse":true},"edit":[{"from_length":100,"to_length":100}],"rank":"1"}]},"name":"A_10010_10242_0:0:0_0:0:0_9bf/2","quality":"KCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKA==","score":110,"fragment_prev":{"name":"A_10010_10242_0:0:0_0:0:0_9bf/1"},"identity":1,"secondary_score":[110,110,110,93,93,93,0],"fragment_length_distribution":"5000:0:0:0:1","time_used":1630060}

Grepping that same read pair from the gam resulting from aligning the full set of simulated reads:

{"sequence":"GGTTTCAGTGTTAAATCAGATGTTATTACACACCAAAGGACACATACAGGGGAGAAGCTCTACGTCTGCAGGGAGTGTGGGCGGGGCTTTAGCTGGAAGT","path":{"mapping":[{"position":{"node_id":"103","offset":"9"},"edit":[{"from_length":100,"to_length":100}],"rank":"1"}]},"name":"A_10010_10242_0:0:0_0:0:0_9bf/1","quality":"KCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKA==","score":93,"fragment_next":{"name":"A_10010_10242_0:0:0_0:0:0_9bf/2"},"identity":1,"secondary_score":[110,110,110,110,93,110,93,0],"fragment_length_distribution":"5000:250:50:0:1","time_used":2265261}
{"sequence":"ACATAGGGCTTCTCCCCTGTGTGTGTCCTCTGGTGAGTGAGGAGGACTGACTGCCAGCTAAAGCCCCGCCCACACTCCCTGCAGACATAGGGCTTCTCCC","path":{"mapping":[{"position":{"node_id":"71","offset":"304","is_reverse":true},"edit":[{"from_length":100,"to_length":100}],"rank":"1"}]},"name":"A_10010_10242_0:0:0_0:0:0_9bf/2","quality":"KCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKA==","score":93,"fragment_prev":{"name":"A_10010_10242_0:0:0_0:0:0_9bf/1"},"identity":1,"secondary_score":[110,110,110,110,93,110,0,0],"fragment_length_distribution":"5000:250:50:0:1","time_used":2265261}

Should this be happening? Especially strange since read 1 is mapping to the same node and same position in both cases, but being given different scores.

hgibling commented 4 years ago

Checking if anyone has had a chance to look at this :)

ekg commented 4 years ago

Does the graph have embedded paths? I don't see reference-relative positions on the alignments. Sometimes those can help clarify what's going on.

It looks (based on secondary scores) that there are many equivalent alignments for this read pair. In the first case, the fragment length distribution is set at the default, and so it doesn't change the score based on whether the pair is "proper" or not. In the second, you've aligned a lot of reads and it's figured out the fragment length distribution (or you've set it?) and it seems to be subtracting 17 from the alignment score for not being "properly" paired.

On Tue, Nov 19, 2019 at 10:16 PM Heather Gibling notifications@github.com wrote:

Checking if anyone has had a chance to look at this :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2445?email_source=notifications&email_token=AABDQEPEZVW2FJLDKVI3BO3QURJTLA5CNFSM4IVVWAB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPZQOI#issuecomment-555718713, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOBQD3FIG3HHWTJ7Y3QURJTLANCNFSM4IVVWABQ .

hgibling commented 4 years ago

Thanks for the quick response and sorry for the delay in replying! I reran some troubleshooting tests comparing paths and fragment distribution parameters using v1.21.0 and discovered that the -U flag is pretty important for mapping if your graph includes path information:

image

Though there is a slight negative effect if your "graph" is just the linear reference: image

I should be able to continue my work with the -U parameter. Thanks!