Closed WimSpee closed 6 years ago
This seems to me like a good plan, and it is similar to what I am exploring in various contexts, although with a different assembly input. Typically I am using assembly graphs from pacbio or ONT data.
I would add several possible alternative approaches that avoid the potential problems with BAM export, which is provided for use in cases where the graph is generally partially ordered, and is likely to require further development to work in a reasonable way in a generic graph such as one made by cactus.
First, I would do single-sample calling using vg call
, then add the results back to the reference graph and iterate the process. The goal would be to generate a pangenome object that encodes all of the variation and sequence of interest in the entire set. At the end, another round with vg call
would generate a per-sample graph for each sample, and we would have minimized bias with respect to the alleles in the pangenome. It is possible (although not well explored) to then convert this into a vector representation, and of course a matrix can be built up from all the samples (vg pack
). At this point we would have something very similar to a multisample VCF, but over a graph. This matrix could then be used directly for various analyses, such as PCA or GWAS-like experiments. Retaining coverages in some normalized way would allow the exploration of copy number variation alongside other kinds of variation. This is admittedly extremely exploratory, but I think it's interesting and I'll be working through it in other contexts.
You can also avoid the BAM step entirely, and project your results out to VCF. That's pretty well supported and avoids the issues with the BAM step.
Thank you Erik for your quick answer.
What I meant in my step 2
is to align + variant call all re-sequenced individuals against the graph. Good to know that this is best done as a 2 loops iterative process. I guess this also includes vg augment
next to (or instead of) vg call
and vg map
in the first loop?
The bam output was just so that the alignments can be viewed in IGV. But I understand there will also be native GAM viewers, which could be a suitable alternative.
The reason I am starting with multiple linear pseudo chromosome references and not their assembly graphs is:
Good to know that there is the option to output VCF. The VCF is in linear reference genome space? (of one of the input pseudo-chromosome references of choice?)
Being able to look at the variants and genotypes in linear reference pseudo-chromosome space is really important to us.
I guess most people / organizations will (at least for some time) still work in mixed graph and linear pseudo-chromosome reference genome space. Thus good and easy conversions between graph and linear genome space would be really useful.
After the holidays I will try to have a go at VG with our data. Thank you (and the other developers/authors) for the software and all the documentation and information.
I have a similar use-case. I am working in cattle (very similar to humans) and expect to start with contigs generated by SuperNova (10X Chromium) using the "pseudo-hap2" export option. This generates a diploid linear representation encoded in two fasta files - one per chromosome.
My plan had been to
I expect a total of around 30 animals with SuperNova and say 800 or so with Illumina short read data.
My question is why would you not use the msga option?
I also thank you for the work going into improving population resequencing.
@WimSpee I am very curious how you would implement
- I could use Cactus to create a pangenome reference graph of all my cultivated and wild reference genomes? (https://github.com/glennhickey/progressiveCactus ? )
Could you elaborate on intermediary steps to achieve this? How would you convert the output of progressiveCactus
(presumably a .hal file) into input for vg
?
@jelber2 I don't know. I also would like to know if there is a more detailed description on how to use the progressiveCactus
output in vg
.
@WimSpee
So, one can convert the hal output of progressiveCactus
into MAF,
Export a MAF consisting of the alignment of all apes referenced on gorilla
(hal/README.md)
hal2maf mammals.hal mammals.maf --rootGenome ape_ancestor --refGenome gorilla
then convert the MAF into a VCF file for input into vg
using snps-sites and use the reference you chose as the --refGenome
for vg
***Edit snps-sites doesn't allow for indels, but msa2vcf does!
It should be possible to directly convert hal to vg, correct? @glennhickey?
On Mon, Jan 15, 2018 at 10:54 AM Jean Elbers notifications@github.com wrote:
@WimSpee https://github.com/wimspee So, one can convert the hal output of progressiveCactus into MAF,
Export a MAF consisting of the alignment of all apes referenced on gorilla (hal/README.md)
hal2maf mammals.hal mammals.maf --rootGenome ape_ancestor --refGenome gorilla
Then convert the MAF into a VCF file for input into vg using snps-sites and use the reference you choose as the --refGenome for vg
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1334#issuecomment-357649096, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EYmFBnMy5QqU41jZnpCNuOYL-1EPks5tKy50gaJpZM4RILWB .
Yes, there is a direct converter here:
https://github.com/ComparativeGenomicsToolkit/hal2vg
On Wed, Jan 24, 2018 at 3:07 PM, Erik Garrison notifications@github.com wrote:
It should be possible to directly convert hal to vg, correct? @glennhickey?
On Mon, Jan 15, 2018 at 10:54 AM Jean Elbers notifications@github.com wrote:
@WimSpee https://github.com/wimspee So, one can convert the hal output of progressiveCactus into MAF,
Export a MAF consisting of the alignment of all apes referenced on gorilla (hal/README.md)
hal2maf mammals.hal mammals.maf --rootGenome ape_ancestor --refGenome gorilla
Then convert the MAF into a VCF file for input into vg using snps-sites and use the reference you choose as the --refGenome for vg
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1334#issuecomment-357649096, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAI4EYmFBnMy5QqU41jZnpCNuOYL-1EPks5tKy50gaJpZM4RILWB .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1334#issuecomment-360257584, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7vAt7PZL8m6nTXf8morEhXlTrhMWks5tN42OgaJpZM4RILWB .
Hi @WimSpee, I have a similar use-case and was wondering if you were able to implement your 2017 plan successfully?
Dear Variant Graph (VG) developers,
Thank you for the software and the recent preprint which I read with great interest. https://www.biorxiv.org/content/early/2017/12/15/234856
One important usecase for graph references is plant and animal breeding. In this context whole genomic segments are often introgressed from wild to cultivated species. Reference bias is thus a big issue.
In for example tomato breeding many genomic segments are intogressed from wild species like Solanum pennellii to the cultivated Solanum lycopersicum tomato species.
See for example for some background information:
The genome of the stress-tolerant wild tomato species Solanum pennellii https://www.nature.com/articles/ng.3046
The tomato genome is diploid and has c.a. 1/3 the size of human genome, c.a. 1 GigaBase.
How would you recommend that VG can be used in the case where one has at least 1 wild and 1 cultivated reference genome, and hundreds to many thousands of whole genome re-sequenced samples?
Creating the base reference genome graph from 1 linear reference genome and a multi-sample VCF file seems strange to me. This would just introduce the reference bias to the graph that we are trying to avoid.
Do I understand it correct that: 1) I could use Cactus to create a pangenome reference graph of all my cultivated and wild reference genomes? (https://github.com/glennhickey/progressiveCactus ? ) 2) I could then align all re-sequenced individuals against the pan genome graph using VG? 3) I could then export per sample BAM and 1 multi-sample VCF file for each input reference genome using VG? ( i.e. a projection of the reads and variants against each input linear reference)
Thank you.
Wim Spee