vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 193 forks source link

Very different mapping results with newer vg version #657

Open hgibling opened 7 years ago

hgibling commented 7 years ago

Hello,

A while back I aligned simulated reads to a graph using vg v1.4.0-1017-g0042f59. As part of a new test, I've reindexed the graph and remapped the simulated reads using v1.4.0-2142-gfe5e050 (as the old indices were not compatible with the new vg map). But I'm getting drastically different alignment scores--for several reads they are now much worse.

vg view -a old-index-map.gam | jq '.score' | sort -n | uniq -c

      5 90
    160 95
      4 96
      6 97
      2 98
      3 99
   1820 100

vg view -a new-index-map.gam | jq '.score' | sort -n | uniq -c

     32 null
      2 4
     19 5
      3 6
      1 7
      3 14
      1 16
      1 18
      2 42
      9 47
     19 52
      1 65
      1 74
      1 75
      4 80
      1 84
      1 85
      2 87
     29 90
      1 91
     15 92
    227 95
      2 96
      3 97
      1 98
   1619 100

The most extreme example had a score of 95 under the older vg mapping, but now the read doesn't map at all. I've isolated the read in question (A_586_923_1:0:0_0:0:0_1ec/2) and mapped just it and its pair, where it again is unmapped (the pair maps perfectly):

vg map -f read1.fq -f read2.fq -x new-index.xg -g new-index.gcsa graph.vg > new-index-read1-read2-sep.gam
vg view -a new-index-read1-read2-sep.gam | jq '.score'

100
null

But when I map just the read, it maps as expected:

vg map -f read2.fq -x new-index.xg -g new-index.gcsa graph.vg > new-index-read2.gam
vg view -a new-index-read2.gam | jq '.score'

95

And when I map both the read and its pair in the same fq, it maps as expected:

vg map -f read1read2.fq -x new-index.xg -g new-index.gcsa graph.vg > new-index-read1-read2-both.gam
vg view -a new-index-read1-read2-both.gam | jq '.score'

100
95

The reads were simulated from a path in the graph with a small mutation rate, so I don't see why they all shouldn't map with high scores (90-100 for 100bp reads). A_586_923_1:0:0_0:0:0_1ec/2 has only a single mutation. I'm not sure if the issue is with the new indices or the new mapping function... thoughts on what's going on?

I've attached the graph/indices/etc. (The old indices used -k 11, while the new ones used -k 15, but I reindexed using old vg with -k 15 and get the same output as with old vg and -k 11. Indexing using new vg and -k 11 gives slightly different results from new vg and -k 15, but still the same pattern of several reads mapping poorly when they shouldn't.)

ekg commented 7 years ago

I'm currently working on the paired end mapping. It wouldn't be strange to me that the current master has problems with pairs, as simulations suggest we weren't dong very well even though there were big improvements in single-ended performance. I'll ping this thread when I've pushed the most recent developments. For the moment you can fall back on the old pairing model using the -a flag to the mapper.

On Thu, Feb 9, 2017 at 11:56 PM Heather Gibling notifications@github.com wrote:

Hello,

A while back I aligned simulated reads to a graph using vg v1.4.0-1017-g0042f59. As part of a new test, I've reindexed the graph and remapped the simulated reads using v1.4.0-2142-gfe5e050 (as the old indices were not compatible with the new vg map). But I'm getting drastically different alignment scores--for several reads they are now much worse.

vg view -a old-index-map.gam | jq '.score' | sort -n | uniq -c

  5 90
160 95
  4 96
  6 97
  2 98
  3 99

1820 100

vg view -a new-index-map.gam | jq '.score' | sort -n | uniq -c

 32 null
  2 4
 19 5
  3 6
  1 7
  3 14
  1 16
  1 18
  2 42
  9 47
 19 52
  1 65
  1 74
  1 75
  4 80
  1 84
  1 85
  2 87
 29 90
  1 91
 15 92
227 95
  2 96
  3 97
  1 98

1619 100

The most extreme example had a score of 95 under the older vg mapping, but now the read doesn't map at all. I've isolated the read in question (A_586_923_1:0:0_0:0:0_1ec/2) and mapped just it and its pair, where it again is unmapped (the pair maps perfectly):

vg map -f read1.fq -f read2.fq -x new-index.xg -g new-index.gcsa graph.vg > new-index-read1-read2-sep.gam vg view -a new-index-read1-read2-sep.gam | jq '.score'

100 null

But when I map just the read, it maps as expected:

vg map -f read2.fq -x new-index.xg -g new-index.gcsa graph.vg > new-index-read2.gam vg view -a new-index-read2.gam | jq '.score'

95

And when I map both the read and its pair in the same fq, it maps as expected:

vg map -f read1read2.fq -x new-index.xg -g new-index.gcsa graph.vg > new-index-read1-read2-both.gam vg view -a new-index-read1-read2-both.gam | jq '.score'

100 95

The reads were simulated from a path in the graph with a small mutation rate, so I don't see why they all shouldn't map with high scores (90-100 for 100bp reads). A_586_923_1:0:0_0:0:0_1ec/2 has only a single mutation. I'm not sure if the issue is with the new indices or the new mapping function... thoughts on what's going on?

I've attached https://github.com/vgteam/vg/files/765210/share.tar.gz the graph/indices/etc. (The old indices used -k 11, while the new ones used -k 15, but I reindexed using old vg with -k 15 and get the same output as with old vg and -k 11. Indexing using new vg and -k 11 gives slightly different results from new vg and -k 15, but still the same pattern of several reads mapping poorly when they shouldn't.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/657, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EXvTLFuzYXSddC1vw7XFCGGu3Vjhks5ra5mFgaJpZM4L8xoD .

hgibling commented 7 years ago

Awesome, thanks!

ekg commented 7 years ago

Would you please check things on the current HEAD of master?

I've just pushed changes to the paired end mapping code that appear on my tests to get us to equivalence with bwa mem on a linear reference and improvement with a variation graph.

On Fri, Feb 10, 2017 at 7:58 PM Heather Gibling notifications@github.com wrote:

Awesome, thanks!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/657#issuecomment-279033985, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EZ_b-fET_Tyj-nv0EAf94Zkp-Q1Dks5rbLNNgaJpZM4L8xoD .

hgibling commented 7 years ago

Slightly better with v1.4.0-2212-gc357e48, but still not what I got originally

      1 null
      2 4
      6 5
      1 15
      3 42
      1 47
      5 52
      2 80
      1 82
      1 85
      1 87
     20 90
     11 92
    207 95
   1738 100
hgibling commented 7 years ago

Even with v1.4.0-2142-gfe5e050 map -a, I'm getting some reads that aren't aligning to a graph when two -f flags are given for paired end reads, but when I cat the fqs and submit them all under one -f flag, they map with very high scores.

This is for a different graph than what I discussed initially in this thread, but it shouldn't be happening. It doesn't occur when I index the graph and map with v1.4.0-1017-g0042f59.

ekg commented 7 years ago

I see, it is probably because of changes in the paired end mapping code that require a fragment model to be matched in the clustering step. It could be a problem with that being too restrictive. Or, it could be that the single ended mapping is too permissive about where the reads map. Do you have a sense of whether the reads are being placed correctly when mapped single end but not when mapped as a pair?

On Wed, Feb 15, 2017, 4:59 PM Heather Gibling notifications@github.com wrote:

Even with v1.4.0-2142-gfe5e050 map -a, I'm getting some reads that aren't aligning to a graph when two -f flags are given for paired end reads, but when I cat the fqs and submit them all under one -f flag, they map with very high scores.

This is for a different graph than what I discussed initially in this thread, but it shouldn't be happening...

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/657#issuecomment-280051354, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EWECSVLOhgXbTzw8VBz10XdQREb4ks5rcyDrgaJpZM4L8xoD .

hgibling commented 7 years ago

Yes, they are being placed corrected when mapped single end

hgibling commented 7 years ago

Is there a resource you could perhaps point me towards to understand the differences between the id clustering and MEM-threading mapping?

ekg commented 7 years ago

@hgibling here's a summary.

In the old version (accessible still via vg map -a) we would generate clusters based on the proximity of the MEM hits in id space. Provided the graph is partially ordered, this would be fine. Pair rescue was very different too, with a large id range of the graph being considered.

Now, the clustering generates partial alignments. These alignments "thread" through MEMs, thus the name of the pattern. Then the partial alignments are patched up as needed by aligning the unaligned portions of the read between or off the ends of the MEMs that we've threaded through.

A pair model is learned from the reads that map with above a certain level of identity to the graph. The model is used to drive the clustering (pairs are aligned as one fragment with a long internal gap), and also when trying to align one read off the other in the "pair rescue" process.

This pair model can be specified. One possibility with some of the reads is that the pair model is not built up yet, but they are not flagged for realignment once it has been. To see if this might be the problem you can specify the pair distribution parameters on the command line with vg map -W ...:

    -W, --fragment m:μ:σ:o:d  fragment length distribution specification to use in paired mapping (default: 1e4:0:0:0:1)
                              max, mean, stdev, orientation (1=same, 0=flip), direction (1=forward, 0=backward)

If you could share this example I might also be able to look at what's going on.

For what it's worth, the new mapper outperforms bwa mem on the DAG like graphs I have made from VCF files. It could be that your graph violates some assumptions that should be relaxed or generalized. At the moment I can't think of what they might be. We can also simulate reads from the pangenome and measure their performance with the new and old mappers by using the -a flag.

hgibling commented 7 years ago

Ok, so with vg v1.4.0-2212-gc357e48 and W 1e4:245:50:0:1

Original graph I was talking about in this thread: vg map -W

    269 null
      1 5
      1 61
      2 75
      1 77
      3 79
      3 80
      1 82
      2 84
      2 87
      4 90
      1 92
     86 95
   1624 100

vg map -W -a

      2 null
      5 90
    159 95
      4 96
      6 97
      2 98
      3 99
   1819 100

Different graph I mentioned here vg map -W

    325 null
      1 10
      3 75
      3 80
      2 85
      1 89
     18 90
      1 92
      1 93
    166 95
      1 96
      2 97
      1 98
      1 99
   1474 100

vg map -W -a

    732 null
      3 90
    103 95
      2 96
      3 97
      1 98
      2 99
   1154 100

Still getting reads that aren't mapping at all. Entirely possible I'm misunderstanding the -W parameters--I simulated reads with wgsim -d245 -s50 -e0.0 -N1000 -1100 -2100 -r0 -R0 -X0, so figured I'd keep max fragment length at 1e4, mean=245, sd=50. Let me know if this is incorrect. All reads map great with v1.4.0-1017-g0042f59.

For what it's worth, the new mapper outperforms bwa mem on the DAG like graphs I have made from VCF files. It could be that your graph violates some assumptions that should be relaxed or generalized. At the moment I can't think of what they might be. We can also simulate reads from the pangenome and measure their performance with the new and old mappers by using the -a flag.

My graphs are directed but cyclic. The original one: old

And the newer one: new

If there are specific files I can share to help, let me know!

ekg commented 7 years ago

I wonder if this is suggesting that certain aspects of the new mapper are too complex for sensible use. Or there might be a bug in the clustering for cyclic graphs. I can look into both of these possibilities.

It is interesting that on one graph the new mapper is doing better. While on the other graph the pattern is reversed. That suggests some kind of optimization toward a particular type of graph.

On Fri, Feb 24, 2017, 10:53 PM Heather Gibling notifications@github.com wrote:

Ok, so with vg v1.4.0-2212-gc357e48 and W 1e4:245:50:0:1

Original graph I was talking about in this thread: vg map -W

269 null
  1 5
  1 61
  2 75
  1 77
  3 79
  3 80
  1 82
  2 84
  2 87
  4 90
  1 92
 86 95

1624 100

vg map -W -a

  2 null
  5 90
159 95
  4 96
  6 97
  2 98
  3 99

1819 100

Different graph I mentioned here https://github.com/vgteam/vg/issues/657#issuecomment-280051354 vg map -W

325 null
  1 10
  3 75
  3 80
  2 85
  1 89
 18 90
  1 92
  1 93
166 95
  1 96
  2 97
  1 98
  1 99

1474 100

vg map -W -a

732 null
  3 90
103 95
  2 96
  3 97
  1 98
  2 99

1154 100

Still getting reads that aren't mapping at all. Entirely possible I'm misunderstanding the -W parameters--I simulated reads with wgsim -d245 -s50 -e0.0 -N1000 -1100 -2100 -r0 -R0 -X0, so figured I'd keep max fragment length at 1e4, mean=245, sd=50. Let me know if this is incorrect. All reads map great with v1.4.0-1017-g0042f59.

For what it's worth, the new mapper outperforms bwa mem on the DAG like graphs I have made from VCF files. It could be that your graph violates some assumptions that should be relaxed or generalized. At the moment I can't think of what they might be. We can also simulate reads from the pangenome and measure their performance with the new and old mappers by using the -a flag.

My graphs are directed but cyclic. The original one: [image: old] https://cloud.githubusercontent.com/assets/7570351/23322104/7a498ed8-fab0-11e6-8641-5459fb8bd5a7.png

And the newer one: [image: new] https://cloud.githubusercontent.com/assets/7570351/23322240/04032a58-fab1-11e6-865c-d0cc800a5eca.png

If there are specific files I can share to help, let me know!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/657#issuecomment-282415753, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4Ec5sp8AlBT9XUBBer4hrt-tw6l_Dks5rf1FggaJpZM4L8xoD .