neoclassical bacterial pangenomics

ekg commented 6 years ago

In https://github.com/ekg/freebayes/issues/419 @tseemann asked if vg was ready for bacterial pangenomics. My response better fits here so I've copied it over:

@tseemann you already can use vg on large pools of bacteria, although admittedly we haven't tested as far as you might like to try. Several people, myself included, have built assembly graphs from collections of bacterial genomes and then used vg to align the read sets back to this assembly graph. By considering the coverages of alignment sets you get a kind of colored string graph, where each base and each edge (rather than each kmer) has a weight for each sample that you're considering.

So for instance you can make a complex pangenome graph using a compacted deBruijn graph assembler like minia:

pan_minia_k51_m3 contigs norm vg gfa 1

Then align the samples back to it independently, yielding graph alignments. Then using tools in vg you can extract the coverage of each sample across the linearized graph, yielding something like this:

bacteria cov 3000000-3020000 area_stack

@iqbal-lab have two projects which tackle different aspects of this. In gramtools you can infer a likely haplotype from a pangenome and align your data against that--- it's kind've a different beast than vg, in that vg treats the graph as the reference object at all points of analysis. They also have some really cool work on generating bacterial pangenome databases. My impression is that this is oriented more towards efficient detection / presence / classification questions than vg, which is oriented towards base-level evaluation of variation and sequence in order-of-magnitude smaller sets of genomes. I would be happy to be corrected if I'm wrong about this assessment.

tseemann commented 6 years ago

Thanks very much for this cross-post!

A standard task in bacterial comparitive genomics is to build a multi-sequence alignment (MSA) of all the "core" SNP locations. "Core" means present in ALL the samples and NOT monomorphic (same nucletoide in all samples).

This seems like a good use case for vg ?

Call all SNP variants (indels are ignored) on paths supported by all samples?

apredeus commented 5 years ago

Very nice to find this discussion. I was thinking about using vg to identify pseudogenes in cases when there are many finished sequenced genomes.

hdore commented 2 years ago

Hi all,

With a similar idea as @tseemann in mind, I'm wondering if vg call can accurately call variants for (haploid) bacterial genomes from metagenomics datasets. In issue #3369 @adamnovak explains that "vg call is I believe still just a single-sample caller (it tries to decide on a genotype for each variant in "the sample", and won't like variants that look like they have unbalanced coverage or more than 2 alleles present)". In the case of metagenomics reads mapped to a bacterial genome graph, the coverage will likely be considered as "unbalanced" (the alleles can have any frequency, and there can be more than 2 alleles present). Should we thus consider that vg call cannot be used to call variants from metagenomics data? I'd be curious to have your thoughts!

Thanks, Hugo

adamnovak commented 2 years ago

I don't think vg call is going to work well for metagenomic calling; it would need a new mode where we turn off some of the diploid heuristics, or we would need a new tool.

I think @glennhickey knows the calling pipeline better than me, though.

glennhickey commented 2 years ago

Yeah, vg call expects a single sample with ploidy (specified with -d) 1 or 2.

ekg commented 2 years ago

The cleanest approach is coverage across the graph. The "pangenome matrix", but this needs to be subdivided by experiment, context, or some kind of species or OTU assignments to be useful. I'm not sure you need to do full on variant calling to get mileage out of this approach.

On Thu, Oct 21, 2021, 21:41 Glenn Hickey @.***> wrote:

Yeah, vg call expects a single sample with ploidy (specified with -d) 1 or 2.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1589#issuecomment-948946368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKHPLLY5VA4PI6POU3UIBUH3ANCNFSM4EYJVPOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

hdore commented 2 years ago

Hi all, Thank you for your advice, I think getting the coverage of "variants" might indeed be a good way to accomplish what I want. I'm still learning how to navigate vg tools and getting use to think in a graph space, but I can't figure how to get this coverage. I have made a graph from multiple genomes using pggb and then converted it to a vg object. I managed to index this graph and map long reads to it with vg map. But I'm still struggling to get the coverage of "variants". I think it might be done internally by vg call but I don't see a way to output it?

I would like to extract structural variants from the graph, and then get the coverage of these variants in a number of samples that are in different gam files.

In my mind, these structural variants are snarl traversals. I did not find a way to get the coverage of snarl traversals. vg stats -R seems to give the depth of each snarl, but I'm not sure how to read its output and it does not seem to give the coverage of each traversal of a given snarl. Another solution might be to consider these structural variants as paths in the graph, and then to get there coverage with vg paths -c . But here again I don't know how to get from snarlTraversals to paths.

Also, since I'm interested in rather large variants (let's say > 100 bp), is there a way to filter out small snarls?

Please let me know if my reasoning does not make sense, or if I'm missing some easy way to reach my goal.

Thank you for your help,

Hugo

hdore commented 2 years ago

Hi @ekg @adamnovak @glennhickey ,

Would you have any advice on the case I developed above?

Thanks a lot for your help!

Hugo

glennhickey commented 2 years ago

@hdore The only way I can think of to get the coverage of a snarl traversal is the AD field in the VCF output of vg call. If you're genotyping multiple samples, you may want to use -a in vg call to make sure it outputs reference-only sites.

You can also filter by snarl size in vg call (based on ref allele length) using -C and -c. I think these options are in the master branch only and will be in the next release.

In the next release, vg call will also output the snarl traversal (as a sequence of nodes) for each allele in the AT field (like deconstruct does). I just made this change in #3497

hdore commented 2 years ago

Hi @glennhickey, Thank you for your quick reply. Unfortunately in my case I can't use vg call because I'm using bacterial metagenomics sample, where there is no ploidy (the genomes are haploid, but the samples are a mix of strains). That's good to know that vg call will soon output snarl traversals. I wish it could work to only get allele coverage without trying to genotype, so without any assumption on ploidy.

Thanks again! Hugo

glennhickey commented 2 years ago

I see. vg call can output traversals without trying to genotype with the -T option. I updated PR #3497 to add read support to this output.

hdore commented 2 years ago

Awesome, thank you!

vgteam / vg

neoclassical bacterial pangenomics #1589