Compare mapping to sample graph vs. mapping to full graph

This is a good comment that Jonas made in the Google doc: It might be interesting to compare mapping/identity statistics to the sample graph with the full graph. This would give an indication on the quality of the genotyping since wrong genotypes would likely result in fewer mapped reads and lower identity. I have the data now and will generate a plot tomorrow.

All the plots from the yeast data basically visualize the same thing: how well the same sets of Illumina short reads map to graphs generated with the VCF or cactus approach.

For the mapping evaluation, the plot shows how well the reads map to the original (full) graphs.
For the genotyping evaluation, the plot shows how well the reads map to the sample graph (derived from the genotype calls).

Until now, I had never overlaid both kinds of data but the results look quite interesting:

Mapping quality: mapping mapq full vs sample

Alignment identity: mapping id full vs sample

For the cactus graph, the reads are mapped better to the full graph as we would expect. Surprisingly, this is not the case for the VCF graph (x-axis): More reads are mapped with mapq>0 to the sample graph than to the full graph. And far more reads are mapped with 100% identity to the sample graph than to the full graph.

I have two ideas on this and would be curious to hear yours as well:

The genotyping (run with --recall for the cactus graph but without --recall for the VCF graph) not only picks up SVs but also SNVs and Indels which all get included into the sample graph. As a consequence, the sample graph is much better than the original graph and more reads get mapped.
The original VCF graph might be more repetitive because it contains the variation from several strains. And unlike the cactus graph, each variation exists as a separate branch off the reference path. Due to the repeat content, many reads might be mapped with mapq=0 to the full graph but not to the sample graph where much of it is removed.

I'm not sure which conclusions to draw from this. One problem is that the cactus graph gets a head start because it contains all types of variation and not only SVs. I do not see a way around this in our experiments but it's one more reason to move the mapping evaluation plot to the supplement. It's not saying much beyond that the set of variants included in the VCF is incomplete. But it's a good point in favor of the cactus approach because it does not require running different variant callers to obtain all these different types of variants.

Good points. The mapping experiments don't allow us to just cut off variation below 50bp like we do in the rest of the paper, so we have to be careful.

For "whole-graph" comparisons, I guess we would need to include SNP calls in the VCF graph in order to properly compare it to the Cactus graph. This isn't hard from a vg point of view, I don't think, but I don't know if you have a way of getting SNPs from your data. Moving to the supplement and making clear that this effect could be the source of the signal we're seeing seems reasonable too. Like you say, it's a benefit of Cactus that you get all variation from it as opposed to just SVs from asmvar.

Allowing the VCF sample graph to get augmented by vg call mitigates this somewhat, but I think there could still be a bias here. Toggling --recall on and off changes the output substantially. It may be fairer to run them both without --recall.

Graph normalization can have a substantial effect on mapqs for the reasons you state in point 2). Are you normalizing your graphs? Putting it through "vg mod -U 10 graph.vg | vg mod -X 32 -" could help considerably. It did for the HGSVC graphs which suffered from the same problem (similar insertions from different samples lowering MAPQ).

On Wed, May 8, 2019 at 9:36 AM David Heller notifications@github.com wrote:

All the plots from the yeast data basically visualize the same thing: how well the same sets of Illumina short reads map to graphs generated with the VCF or cactus approach.

For the mapping evaluation, the plot shows how well the reads map to the original (full) graphs.

For the genotyping evaluation, the plot shows how well the reads map to the sample graph (derived from the genotype calls).

Until now, I had never overlaid both kinds of data but the results look quite interesting:

Mapping quality: [image: mapping mapq full vs sample] https://user-images.githubusercontent.com/6477692/57377777-8f579380-71a3-11e9-9804-02c3ae0a960a.png

Alignment identity: [image: mapping id full vs sample] https://user-images.githubusercontent.com/6477692/57377782-91b9ed80-71a3-11e9-8da3-44605fa52532.png

For the cactus graph, the reads are mapped better to the full graph as we would expect. Surprisingly, this is not the case for the VCF graph (x-axis): More reads are mapped with mapq>0 to the sample graph than to the full graph. And far more reads are mapped with 100% identity to the sample graph than to the full graph.

I have two ideas on this and would be curious to hear yours as well:

The genotyping (run with --recall for the cactus graph but without --recall for the VCF graph) not only picks up SVs but also SNVs and Indels which all get included into the sample graph. As a consequence, the sample graph is much better than the original graph and more reads get mapped.

The original VCF graph might be more repetitive because it contains the variation from several strains. And unlike the cactus graph, each variation exists as a separate branch off the reference path. Due to the repeat content, many reads might be mapped with mapq=0 to the full graph but not to the sample graph where much of it is removed.

I'm not sure which conclusions to draw from this. One problem is that the cactus graph gets a head start because it contains all types of variation and not only SVs. I do not see a way around this in our experiments but it's one more reason to move the mapping evaluation plot to the supplement. It's not saying much beyond that the set of variants included in the VCF is incomplete. But it's a good point in favor of the cactus approach because it does not require running different variant callers to obtain all these different types of variants.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jmonlong/manu-vgsv/issues/76#issuecomment-490488403, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG373SQ6BBN7G5IDCZHXM3PULJO3ANCNFSM4HLJ4CWQ .

Thanks Glenn, I included --recall only for the cactus graph because it was the best parameter setting for both graphs. In that way, it is a fair comparison because it reflects the structural difference between the graphs. Running both without --recall would make the results on the cactus graph slightly worse and I'm also a bit hesitant to re-run everything now at this late stage.

I'm normalizing both graphs with vg mod -X 32 but not with vg -U 10. Is the missing part the important bit? Changing it would require re-running everything, though, and would probably take some time..

It's the mod -U 10 that does the normalizing (it also unchops the nodes so always needs to be followed by mod -X 32 in practice). Not using it could lower the MAPQ, though there is no way to know without trying. I think pushing MAPQ comparison on the whole graphs into the supplement and mentioning normalization could be a factor is fine for now. Though it would be interesting to rerun at some point.

I guess using --recall with cactus and not using for construct is okay, if individually these are the best parameter sets for each graph, respectively. I would expect these results to be fairly robust to minor parameter differences, but --recall does lower the support threshold at which a variant can be called.

On Wed, May 8, 2019 at 11:13 AM David Heller notifications@github.com wrote:

Thanks Glenn, I included --recall only for the cactus graph because it was the best parameter setting for both graphs. In that way, it is a fair comparison because it reflects the structural difference between the graphs. Running both without --recall would make the results on the cactus graph slightly worse and I'm also a bit hesitant to re-run everything now at this late stage.

I'm normalizing both graphs with vg mod -X 32 but not with vg -U 10. Is the missing part the important bit? Changing it would require re-running everything, though, and would probably take some time..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jmonlong/manu-vgsv/issues/76#issuecomment-490527247, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG373QYMLXLFOE3CPSMBPDPULUZZANCNFSM4HLJ4CWQ .

jmonlong / manu-vgsv

Compare mapping to sample graph vs. mapping to full graph #76