Yeast: Sample graph evaluation on SV callset

jmonlong / manu-vgsv

https://jmonlong.github.io/manu-vgsv/

Other

9 stars 2 forks source link

Yeast: Sample graph evaluation on SV callset #95

Closed eldariont closed 5 years ago

eldariont commented 5 years ago

Hi,

in response to Jonas' PR #83, I noticed that I was actually comparing general genotyping performance instead of SV genotyping performance because my workflow was like this:

mapped reads (to original graph) -> callset (SNPs, Indels, SVs) -> sample graph -> mapped reads (to sample graph) -> mapq and perc. id distribution

instead of:

mapped reads (to original graph) -> callset (SNPs, Indels, SVs) -> callset (SVs only) -> sample graph -> mapped reads (to sample graph) -> mapq and perc. id distribution

Now I finally have results for the second workflow, too, and the results make a lot of sense.

When comparing general genotyping performance (first workflow, Fig 1):

The cactus graph picks up all kinds of variation between the strains much better and consequently performs similarly on both clades.
The VCF graph has more difficulties genotyping the paradoxus clade because it is so biased by the reference strain and consequently loses against the cactus graph there by a wide margin.

Fig 1: constructunion four reads identity

When comparing SV genotyping performance only (second workflow, Fig 2):

Both graphs perform much worse on the paradoxus clade because the sample graph is constructed from SVs and the linear reference only. To fully represent the paradoxus clade, you need SNPs and Indels.
The cactus graph performs slightly better than the VCF graph and the improvement is larger on the paradoxus clade. The differences are only small because the sample graph is missing all the small variants.

Fig 2: constructunion four reads identity

eldariont commented 5 years ago

The question now is: Which of these results should we show in the main text and which in the supplementary material only? In my opinion, the SV-only results (Fig 2) are not very impressive because the improvement over the VCF graph is so slight. Fig 1 instead shows that the cactus graph has the power to genotype all types of variants so accurately that we reach between 80% and 90% percent identity of the reads to the callset (i.e. the sample graph) even on the distant clade while the VCF approach reaches only 60% to 70%. On the other hand, the entire manuscript is about SV genotyping so it might be more coherent to show improvement in SV genotyping only (i.e. Fig 2).

Our problem is that there is no perfect way to measure SV genotyping accuracy in yeast. The sample graph approach basically evaluates all types of variation because the reads that we map to the sample graph contain all types of variation. Therefore, the contribution of SVs is rather small as can be seen in Fig. 2. Using a gold standard on the other hand evaluates SVs specifically but there is no good standard available for yeast.

What is your opinion on this?

benedictpaten commented 5 years ago

David, what about the equivalent avg. mapping quality plot?

jmonlong commented 5 years ago

Yes it's a bit tricky to compare those graphs, by filtering out small variants in the cactus graph we end up underestimating its value a bit and take the risk of filtering out SVs that were fragmented during genome alignment. But it's the most relevant/fair comparison for a SV genotyping paper.

We are currently showing average mapping qual or identity but many of the reads will be outside of the SV regions and map similarly in both graphs which makes the differences more difficult to see. In the short term we could try to zoom in the SV regions.

One way would be to count how many reads map as well in both graphs, better in the cactus graph, or better in the VCF graph. Maybe a bit of work to match the reads between the 2 graphs, although some scripts might already exists to do that from the read mapping evaluation work.

Another way would be to compute the current metrics but only in regions that contain SVs. Maybe joining the SV calls from both cactus/VCF graphs into a set of candidate SV regions, then filtering the GAMs on these regions and computing the same metrics.

@eldariont, does one of these analyses seem doable?

glennhickey commented 5 years ago

I too am worried about fragmentation when using a size cutoff to filter out non-sv variants. vg call can sometimes represent SV haplotypes using a series of smaller variant lines on the VCF. For the human stuff, we get around this by using Jean's sveval tool when does a good job of working around it. This issue will be even more apparent when running on the Cactus graphs, I think.

If none of the three SV detection tools can do SNPs, can we not just use, say, Freebayes and the Illumina reads to call SNPs for each sample and then add them to the VCF graph for comparison? This seems like the only way to make it fair to me.

eldariont commented 5 years ago

@benedictpaten: The average mapping quality plots show an identical trend (please excuse the differente scales):

General genotyping performance: constructunion four reads mapq nofiltering

SV genotyping performance: constructunion four reads mapq

eldariont commented 5 years ago

Thanks for your ideas.

One way would be to count how many reads map as well in both graphs, better in the cactus graph, or better in the VCF graph. Maybe a bit of work to match the reads between the 2 graphs, although some scripts might already exists to do that from the read mapping evaluation work.

That's an interesting idea. Indeed I would expect that the two sample graphs are largely identical except at the SV regions. So the reads that map better to the VCF sample graph indicate that the VCF graph did a better job at genotyping and likewise for the cactus graph. So instead of plotting the average mapping quality of all the reads we could count how many reads map better to each sample graph? Hm, I'm not sure whether this would tell us more than Fig. 2 above already does: that more reads map better to the cactus sample graph but that most reads map equally to both sample graphs.

Another way would be to compute the current metrics but only in regions that contain SVs. Maybe joining the SV calls from both cactus/VCF graphs into a set of candidate SV regions, then filtering the GAMs on these regions and computing the same metrics.

That sounds very promising. It would give a clear picture of how well the SV regions represent the reads which is what we want to measure. The only problem I see is that this approach does not consider reads that stay unmapped because the genotype call is wrong or missing. I'm not sure how severe this might be but it's definitely worth a try.

Which command would you propose for the filtering? vg surject and then filtering the SAM by position?

eldariont commented 5 years ago

I too am worried about fragmentation when using a size cutoff to filter out non-sv variants. vg call can sometimes represent SV haplotypes using a series of smaller variant lines on the VCF. For the human stuff, we get around this by using Jean's sveval tool when does a good job of working around it. This issue will be even more apparent when running on the Cactus graphs, I think.

Yeah, that is a problem but I don't see a good way around it because we don't have a gold standard. And despite the disadvantage that the cactus graph probably has because it's more fragmented, it still seems to beat the VCF graph.

If none of the three SV detection tools can do SNPs, can we not just use, say, Freebayes and the Illumina reads to call SNPs for each sample and then add them to the VCF graph for comparison? This seems like the only way to make it fair to me.

Yes, I agree that this would make the comparison fairer. It wouldn't solve the problem, though. We would still have two genotype callsets with SNPs, Indels and SVs that we need to evaluate somehow. What do you think about zooming in on the SV regions by filtering the GAM as Jean suggested?

glennhickey commented 5 years ago

I'm ok with both of @jmonlong's suggestions. Zooming into SV regions sounds worth a try, but I think you will need to move away from just reporting averages (as you can have very different read totals between the two methods).

But I'm still trying to get a feeling for what's going wrong. In your original figures, where cactus was much better for paradoxus, you were comparing:

Sample graph from calling toil-vg recall on the Cactus graph. This will contain SNPs, Indels, and SVs from the original cactus graph.
Sample graph from calling toil-vg call on the VCF graph. This will contain SVs from the VCF along with SNPs and indels from vg call.

Then you changed the comparison to (I think?)

Sample graph from toil-vg recall on the Cactus graph, but filtering out variants < 50bp
Sample graph from toil-vg recall on the VCF graph, but filtering out variants < 50bp

and the difference with paradoxus goes away. So what can we conclude? The SNPs and indels as inferred from Cactus help mapping much better than SNPs and indels called by toil-vg? This would suggest a possible bug in toil-vg call's snp calling parameters maybe? (it certainly hasn't been used much on yeast). The alternative would be that calling VCFs on the cactus graphs leads to uglier calls, and fragmentation means a size-based filter removes some SVs. If this second effect is important (I have no idea if it is), that would bias most other types of comparison (such as zooming into regions) too.

I think it would be informative to try to look at a couple of reads that behave different in both comparisons and try to see what's going on. I can help with this, if you point me to the data

eldariont commented 5 years ago

Sample graph from toil-vg recall on the Cactus graph, but filtering out variants < 50bp

Sample graph from toil-vg recall on the VCF graph, but filtering out variants < 50bp

The VCF sample graph comes from toil-vg call, not recall.

So what can we conclude? The SNPs and indels as inferred from Cactus help mapping much better than SNPs and indels called by toil-vg? This would suggest a possible bug in toil-vg call's snp calling parameters maybe? (it certainly hasn't been used much on yeast).

Yes, I think we can conclude that the SNPs and indels as inferred from Cactus help mapping much better than SNPs and indels called by toil-vg. But I'm not sure whether this needs to be due to a bug. Couldn't it just be that SNP and Indel calling is more efficient on the cactus graph because the reads map better to it? vg can call variants from reads that are mapped to the cactus but not to the VCF graph.

The alternative would be that calling VCFs on the cactus graphs leads to uglier calls, and fragmentation means a size-based filter removes some SVs. If this second effect is important (I have no idea if it is), that would bias most other types of comparison (such as zooming into regions) too.

Yeah, fragmentation is probably an important component. The results that Jean reported in Slack (that a considerable fraction of reads map better to the VCF sample graph) seem to suggest that, too. But at the same time I think that the contribution of the SVs to the metrics we use is limited which might be a reason why the improvement of cactus over VCF gets to small. When we compare percent identity and average mapq, mismatches and small indels in the alignments maybe play a larger role than the big SVs.

glennhickey commented 5 years ago

The VCF sample graph comes from toil-vg call, not recall.

I think this has come up before but comparing two sample graphs, one made with --recall and one made without, potentially introduces a hefty bias. The previous justification I understood was to be able to get SNPs and indels into the VCF graph. If we're limiting to just SVs though, there is no need, and it would be much better to use --recall on both. Otherwise, cactus gets an advantage to having a much lower default threshold for an SV to end up in the sample graph.

Yes, I think we can conclude that the SNPs and indels as inferred from Cactus help mapping much better than SNPs and indels called by toil-vg. But I'm not sure whether this needs to be due to a bug. Couldn't it just be that SNP and Indel calling is more efficient on the cactus graph because the reads map better to it? vg can call variants from reads that are mapped to the cactus but not to the VCF graph.

I am skeptical of this. In my experience, the benefits of moving from a flat reference to cactus can often be pretty marginal for most indels and snps. It's not impossible, but it's hard to believe that's the main driver here.

eldariont commented 5 years ago

This issue can be closed now that PR #101 contains updated results. Most importantly, toil-vg call was run with the --recall option on both graphs.