vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 194 forks source link

How to run vg using SV calls? #784

Closed hanzou666 closed 7 years ago

hanzou666 commented 7 years ago

I’m trying to use vg for metagenome analysis. As many assembled fragments have attached to the reference sequence with only in one end (and other terminal of the sequence has not determined), so I created the VCF file which indicates the situation with ']' notation. But vg didn’t accept this notation. I heard vg could accept the format that the 1000 Genomes used in the SV calls instead of VCF file.

Could you tell me how to run vg using SV calls?

edawson commented 7 years ago

Hi @ShuhoOhwada ,

Right now vg can't handle the breakend/breakpoint notation POS\]. Could you post a sample call (or file) here so I can see what your VCF looks like?

It sounds like you're describing is a core reference 'backbone' sequence with 'ribs' or 'tips' of other genomes hanging off. Does this sound correct?

This structure is better described in GFA format than VCF - can you coerce your data to GFA? A lot of assemblers support GFA output directly - you could just assemble and put your assembly graph straight into vg. Quite a few people are trying this, and I'm continuously trying to make sure it remains a functional pipeline.

Most VCF breakend notations are kind of useless on the graph - there's no information about what sequence is attached at that breakend or where the other end might be. This wouldn't translate into a useful graph structure. Hence, vg supports the \, \, and \ SV tags and plans to support \ in the future. I don't think we'll be able (or want) to support the \ or POS] style tags, unless we can figure out a mapping to the graph that makes sense.

hanzou666 commented 7 years ago

Hi @edawson,

It sounds like you're describing is a core reference 'backbone' sequence with 'ribs' or 'tips' of other genomes hanging off. Does this sound correct?

You are quite right.

In case that I have the reference sequences and contigs that are aligned as follows:

ref:    ATTTGCTCACGGATCCCGTCATGGGCTTT
ctg:        GCTCGATTGCA

I want to construct the graph that branches off after the eighth base on reference C.

When I write this in the format that vg construct can use, fasta file is:

>ref
ATTTGCTCACGGATCCCGT
>ctg1
GATTGCA

and VCF file is:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
ref 8   .   C   C[ctg1:1[   6   PASS    SVTYPE=BND  GT
ctg1    1   .   G   ]ref:8]G    6   PASS    SVTYPE=BND  GT

I tried to convert this to VCF file, but I wasn't able to do because there are few examples in the GFA document.

My question is:

  1. Could you tell me how to create GFA file that expresses <BND> and <INV> in VCF?
  2. How do I create vg file using GFA file as input?
adamnovak commented 7 years ago

Unfortunately, you can't make a graph with multiple dangling tips with "vg construct".

I'm not sure what you mean with 1.; I don't know of a breakend-VCF to GFA converter. For 2. you would do vg view -Fv whatever.gfa >whatever.vg to import a GFA file.

edawson commented 7 years ago

I would suggest trying to express your graph in GFA and then reading that in to vg. This should get around construct's restrictions. As to how to go from breakends to GFA, I'm just not sure. None of the SV work I've done has handled breakend format.

You could generate a graph (from non-breakend VCF) using construct and then remove the trailing edges out of the variant bubbles. You'd need the -I <insertion_fasta> (capital i) flag to construct, with your insertion_fasta containing all of the tip sequences with the flanking matches removed. This is not the cleanest way to do things, and you'd spend a lot of time in python or awk, but it might work.

You could try assembling all of your anchored tips using vg msga, but I think this would create a big mess.

hanzou666 commented 7 years ago

Thank you for your suggestions. I try to write scripts for GFA output.

And I have two more questions. After graph construction, I want to map metagenomic reads to reference graph and calculate coverages for each node. The problems are mapping speed and how to handle gam file.

So, How can I speed up the mapping? Does the relation of k-mer of vg index -g index.gcsa -k 11 graph.vg be related? Or is mapping to graph essentially slow?

Is there a vg command to calculate the coverage? Otherwise, are there detailed documentation of the gam file?