How to use VG when having multiple reference genomes for wild and cultivated species?

vgteam / vg

tools for working with genome variation graphs

https://biostars.org/tag/vg/

Other

1.09k stars 193 forks source link

How to use VG when having multiple reference genomes for wild and cultivated species? #1334

Closed WimSpee closed 6 years ago

WimSpee commented 6 years ago

Dear Variant Graph (VG) developers,

Thank you for the software and the recent preprint which I read with great interest. https://www.biorxiv.org/content/early/2017/12/15/234856

One important usecase for graph references is plant and animal breeding. In this context whole genomic segments are often introgressed from wild to cultivated species. Reference bias is thus a big issue.

In for example tomato breeding many genomic segments are intogressed from wild species like Solanum pennellii to the cultivated Solanum lycopersicum tomato species.

See for example for some background information:

The genome of the stress-tolerant wild tomato species Solanum pennellii https://www.nature.com/articles/ng.3046

The tomato genome is diploid and has c.a. 1/3 the size of human genome, c.a. 1 GigaBase.

How would you recommend that VG can be used in the case where one has at least 1 wild and 1 cultivated reference genome, and hundreds to many thousands of whole genome re-sequenced samples?

Creating the base reference genome graph from 1 linear reference genome and a multi-sample VCF file seems strange to me. This would just introduce the reference bias to the graph that we are trying to avoid.

Do I understand it correct that: 1) I could use Cactus to create a pangenome reference graph of all my cultivated and wild reference genomes? (https://github.com/glennhickey/progressiveCactus ? ) 2) I could then align all re-sequenced individuals against the pan genome graph using VG? 3) I could then export per sample BAM and 1 multi-sample VCF file for each input reference genome using VG? ( i.e. a projection of the reads and variants against each input linear reference)

Thank you.

Wim Spee

ekg commented 6 years ago

This seems to me like a good plan, and it is similar to what I am exploring in various contexts, although with a different assembly input. Typically I am using assembly graphs from pacbio or ONT data.

I would add several possible alternative approaches that avoid the potential problems with BAM export, which is provided for use in cases where the graph is generally partially ordered, and is likely to require further development to work in a reasonable way in a generic graph such as one made by cactus.

First, I would do single-sample calling using vg call, then add the results back to the reference graph and iterate the process. The goal would be to generate a pangenome object that encodes all of the variation and sequence of interest in the entire set. At the end, another round with vg call would generate a per-sample graph for each sample, and we would have minimized bias with respect to the alleles in the pangenome. It is possible (although not well explored) to then convert this into a vector representation, and of course a matrix can be built up from all the samples (vg pack). At this point we would have something very similar to a multisample VCF, but over a graph. This matrix could then be used directly for various analyses, such as PCA or GWAS-like experiments. Retaining coverages in some normalized way would allow the exploration of copy number variation alongside other kinds of variation. This is admittedly extremely exploratory, but I think it's interesting and I'll be working through it in other contexts.

You can also avoid the BAM step entirely, and project your results out to VCF. That's pretty well supported and avoids the issues with the BAM step.

WimSpee commented 6 years ago

Thank you Erik for your quick answer.

What I meant in my step 2 is to align + variant call all re-sequenced individuals against the graph. Good to know that this is best done as a 2 loops iterative process. I guess this also includes vg augment next to (or instead of) vg call and vg map in the first loop?

The bam output was just so that the alignments can be viewed in IGV. But I understand there will also be native GAM viewers, which could be a suitable alternative.

The reason I am starting with multiple linear pseudo chromosome references and not their assembly graphs is:

I don't have access (anymore) to all the assembly graphs. Some references are even still short read based.
Even ONT and PacBio based assemblies still needs further scaffolding for most organisms we works with to get the sequence into pseudo-chromosomes. e.g. using genetic maps, Hi-C or BioNano. Some manual curation is often also done here of the assembly / scaffolding.

Good to know that there is the option to output VCF. The VCF is in linear reference genome space? (of one of the input pseudo-chromosome references of choice?)

Being able to look at the variants and genotypes in linear reference pseudo-chromosome space is really important to us.

I guess most people / organizations will (at least for some time) still work in mixed graph and linear pseudo-chromosome reference genome space. Thus good and easy conversions between graph and linear genome space would be really useful.

After the holidays I will try to have a go at VG with our data. Thank you (and the other developers/authors) for the software and all the documentation and information.

mdkeehan commented 6 years ago

I have a similar use-case. I am working in cattle (very similar to humans) and expect to start with contigs generated by SuperNova (10X Chromium) using the "pseudo-hap2" export option. This generates a diploid linear representation encoded in two fasta files - one per chromosome.
My plan had been to

initialize a graph using the reference and/or any known variants in a VCF
For every animal with a set of contigs from a supernova assembly - use msga to augment the graph
Index the combined graph
Do the alignments
Do multisample calling

I expect a total of around 30 animals with SuperNova and say 800 or so with Illumina short read data.

My question is why would you not use the msga option?

I also thank you for the work going into improving population resequencing.

jelber2 commented 6 years ago

@WimSpee I am very curious how you would implement

I could use Cactus to create a pangenome reference graph of all my cultivated and wild reference genomes? (https://github.com/glennhickey/progressiveCactus ? )

Could you elaborate on intermediary steps to achieve this? How would you convert the output of progressiveCactus (presumably a .hal file) into input for vg?

WimSpee commented 6 years ago

@jelber2 I don't know. I also would like to know if there is a more detailed description on how to use the progressiveCactus output in vg.

jelber2 commented 6 years ago

@WimSpee So, one can convert the hal output of progressiveCactus into MAF,

Export a MAF consisting of the alignment of all apes referenced on gorilla (hal/README.md)

     hal2maf mammals.hal mammals.maf --rootGenome ape_ancestor --refGenome gorilla

then convert the MAF into a VCF file for input into vg using snps-sites and use the reference you chose as the --refGenome for vg

***Edit snps-sites doesn't allow for indels, but msa2vcf does!

ekg commented 6 years ago

It should be possible to directly convert hal to vg, correct? @glennhickey?

On Mon, Jan 15, 2018 at 10:54 AM Jean Elbers notifications@github.com wrote:

@WimSpee https://github.com/wimspee So, one can convert the hal output of progressiveCactus into MAF,

Export a MAF consisting of the alignment of all apes referenced on gorilla (hal/README.md)

hal2maf mammals.hal mammals.maf --rootGenome ape_ancestor --refGenome gorilla

Then convert the MAF into a VCF file for input into vg using snps-sites and use the reference you choose as the --refGenome for vg

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1334#issuecomment-357649096, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EYmFBnMy5QqU41jZnpCNuOYL-1EPks5tKy50gaJpZM4RILWB .

glennhickey commented 6 years ago

Yes, there is a direct converter here:

https://github.com/ComparativeGenomicsToolkit/hal2vg

On Wed, Jan 24, 2018 at 3:07 PM, Erik Garrison notifications@github.com wrote:

It should be possible to directly convert hal to vg, correct? @glennhickey?

On Mon, Jan 15, 2018 at 10:54 AM Jean Elbers notifications@github.com wrote:

@WimSpee https://github.com/wimspee So, one can convert the hal output of progressiveCactus into MAF,

Export a MAF consisting of the alignment of all apes referenced on gorilla (hal/README.md)

hal2maf mammals.hal mammals.maf --rootGenome ape_ancestor --refGenome gorilla

Then convert the MAF into a VCF file for input into vg using snps-sites and use the reference you choose as the --refGenome for vg

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1334#issuecomment-357649096, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAI4EYmFBnMy5QqU41jZnpCNuOYL-1EPks5tKy50gaJpZM4RILWB .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1334#issuecomment-360257584, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7vAt7PZL8m6nTXf8morEhXlTrhMWks5tN42OgaJpZM4RILWB .

ac2278 commented 3 years ago

Hi @WimSpee, I have a similar use-case and was wondering if you were able to implement your 2017 plan successfully?