vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

Guidance for SV calling following giraffe and/or pggb #3369

Open clairemerot opened 3 years ago

clairemerot commented 3 years ago

Hello, I am trying to use a variation genome graph to genotype SVs on more samples sequenced with short-reads, and I'm trying different avenues (see 1)graph from vcf + giraffe and 2) graph with pggb+ ?), all facing little issues. I was wondering whether someone here would have helpful insights? Thanks a lot for your attention, thanks for this great tool and for the support found here! Claire

1) I'm building a graph with a reference genome and an explicit vcf of SVs detected by long-reads and assembly comparisons -> then I align with giraffe (apparently good and fast) -> then I'm trying to call SVs (error). It works well until the calling in which it does not have the proper format for the graph. Or should I build the graph into .xg aside from the autoindex? will they be the same? Here is the code I used. (ref is the reference genome in .fasta, vcf_file is the vcf of SV, fq1 and fq2 are paired fastq

autoindex

vg autoindex --workflow giraffe \ --prefix 09_vgiraffe/vcf_graph/vcf_graph_giraffe \ --tmp-dir 09_vgiraffe/vcf_graph/TMPDIR \ --target-mem 190G --threads $NB_CPU \ --ref-fasta $ref \ --vcf $vcf_file

alignment of 1 sample

vg giraffe -t $NB_CPU \ -H 09_vgiraffe/vcf_graph/vcf_graph_giraffe.giraffe.gbwt \ -g 09_vgiraffe/vcf_graph/vcf_graph_giraffe.gg \ -m 09_vgiraffe/vcf_graph/vcf_graph_giraffe.min \ -d 09_vgiraffe/vcf_graph/vcf_graph_giraffe.dist \ -f $fq1 -f $fq2 -N $id > 09_vgiraffe/vcf_graph/alignment/"$id"_paired.gam

vg stats -a 09_vgiraffe/vcf_graph/alignment/"$id"_paired.gam

calling

vg pack -t $NB_CPU -Q 5 \ -x 09_vgiraffe/vcf_graph/vcf_graph_giraffe.gg \ -g 09_vgiraffe/vcf_graph/alignment/"$id"_paired.gam -o 09_vgiraffe/vcf_graph/alignment/"$id".pack

vg call -t $NB_CPU \ -v $vcf_file -k 09_vgiraffe/vcf_graph/alignment/"$id".pack \ 09_vgiraffe/vcf_graph/vcf_graph_giraffe.gg > 09_vgiraffe/vcf_graph/variants/"$id"_calls.vcf

error

error[VPKG::load_one]: Correct input type not found in 09_vgiraffe/vcf_graph/vcf_graph_giraffe.gg while loading handlegraph::HandleGraph error[VPKG::load_one]: Correct input type not found in 09_vgiraffe/vcf_graph/vcf_graph_giraffe.gg while loading handlegraph::PathHandleGraph

2) I'd like to build a graph in pggb to follow the advice provided here https://github.com/vgteam/vg/issues/3194 Instead of building the graph with a reference & a vcf, the graph is built in PGGB with two full assemblies and pieces of sequences for the SV (for example the insertion + 50kb of the reference around it). This seems very promising. We are building graph chromosomes by chromosomes with PGGB in gfa. It works well. However, I don't know how to combine my 40 GFA (one per chromosome)into a single graph before auto-indexing? Next, even if we manage to make the autoindexing on the gfa, and to align with giraffe, what is the procedure for vg call? (I can no longer provide a vcf of variants??? or should I try the "deconstruct"?)

Here is the code I am using for pggb & giraffe

PGGB

INPUT_FASTA=04_fasta/all_seq_Chr10.fasta SEG_LENGTH=25000 OUT_DIR=06_pggb/genome_graph samtools faidx $INPUT_FASTA

Building the graph

pggb -i $INPUT_FASTA \ -s $SEG_LENGTH -n 1 \ -p 95 \ -k 200 \ -N \ -t $NB_CPU \ -o $OUT_DIR

giraffe autoindexing for one chromosome

how to merge chromosome in one graph?? which .gfa use from pggb?

gfa_graph=06_pggb/genome_graph/all_seq_Chr10.fasta.7405a83.d0e96a4.891e76b.smooth.gfa

vg autoindex --workflow giraffe \ --prefix 09_vgiraffe/gfa_graph/gfa_graph_giraffe \ --tmp-dir 09_vgiraffe/gfa_graph/TMPDIR \ --target-mem 190G --threads $NB_CPU \ --gfa $gfa_graph

giraffe alignement->ok

calling: how to do??

Many many thanks for any suggestions or advice!

adamnovak commented 3 years ago

Sorry, autoindex doesn't expect you to actually do anything with your mapped reads. The GBWT and GG together do define a graph, but none of the calling tools know how to read it, nor do we have (as far as I know) a converter that can convert it. Moreover, the GBWT and GG don't (yet) encode the reference paths from the FASTA, so in the final graph there's no way to find the path that the output VCF would need to be referenced against.

If you add -R XG to your autoindex command, it will also produce the graph in XG format, which you can use as the reference for downstream calling. If that produces weird errors downstream about missing IDs, you might need to make sure to scrap the other output files and rerun from the top; I don't know if @jeizenga made it sufficiently deterministic that it would be able to recreate exactly the same graph with exactly the same IDs from the input FASTA and VCF.

@jeizenga, shouldn't we define some new autoindex workflows, or adjust the existing workflows, so that we generate the files you will need to do calling on the resulting mapped reads? Or should we just leave it like this until we finish the GBZ bundling of GBWT and GG and the reference path embedding machinery there? Because the GBZ bundling is going to take a while and the lack of calling indexes is tripping up the users.

jeizenga commented 3 years ago

I've been thinking about expanding the set of workflows a bit already. I think it makes sense to add a surject workflow, and probably also to split out rpvg from mpmap. The way some of the indexes are co-created will make it kind of challenging without reactivating the simplifications code (which I de-activated during the major refactor after realizing how little info was actually contained in the raw NodeMapping).

I also think we might need to define a new "Giraffe XG" index to account for the mismatch in the chopped node IDs between the gbwtgraph and vg repos. As it stands, I'm not sure what the best way is to get an XG with labeled reference paths whose node IDs match a GBZ. Maybe we need the path interface in GBZ first?

glennhickey commented 3 years ago

You can make an xg with vg convert:

vg convert 09_vgiraffe/vcf_graph/vcf_graph_giraffe.gg -b  09_vgiraffe/vcf_graph/vcf_graph_giraffe.giraffe.gbwt -x > 09_vgiraffe/vcf_graph/vcf_graph_giraffe.giraffe.xg

Then send that into pack/call.

This is definitely a workflow we need to work on simplifying / documenting.

adamnovak commented 3 years ago

@clairemerot, as for passing the VCF file through to call, you don't want to do that unless you want to genotype on exactly that VCF file's set of variants. If you leave it out, call will operate on the variants as represented in the graph. If you do pass it, you can't use the XG as produced by autoindex; you'd have to do -R "VG w/ Variant Paths" and use the .vg file that leaves behind. You probably should just use the variants as embedded in the graph, which will have been normalized versus how the input VCF spelled them.

Since you don't actually need a VCF for the calling stage, you can call with the output of Giraffe on the auto-indexed GFA graph.

To get all the GFAs into one GFA, you can use some external GFA manipulation tool (possibly cat, or cat with a grep to remove the extra header lines, if your individual graphs don't re-use node numbers between them). You could also use vg combine, which will read in any number of graphs, re-assign their IDs so they don't conflict, and spit them out in a single file in a VG format.

@glennhickey Isn't the output from vg convert like you show going to be missing the reference path?

glennhickey commented 3 years ago

vg combine can indeed combine gfa files, and it will output gfa by default for such input.

the vg convert command I posted should preserve the reference path (which is in the gbwt with a special name that vg convert looks for).

clairemerot commented 3 years ago

Thanks a lot to all of you for the answers and advice. If I understood correctly, this is much simpler than I felt! Thanks for the guidance For analysis 1, I have tried the two options. Is there one that I should favour? 1A) add the argument -R XG (which seems to work without warning)

vg autoindex --workflow giraffe \
--prefix 09_vgiraffe/vcf_graph/vcf_graph_giraffe \
--tmp-dir 09_vgiraffe/vcf_graph/TMPDIR \
--target-mem 190G --threads 8 \
-R XG \
--ref-fasta $ref \
--vcf $vcf_file 

1B) vg convert (which gives the following warning: warning [vg convert]: no threads for reference sample _gbwt_ref in the GBWT index)

vg convert 09_vgiraffe/vcf_graph/vcf_graph_giraffe.gg \
-b  09_vgiraffe/vcf_graph/vcf_graph_giraffe.giraffe.gbwt \
-x > 09_vgiraffe/vcf_graph/vcf_graph_giraffe_converted.xg

I am now trying the vg pack and vg call with those two .xg (and without vcf)

For analysis 2, I am trying the "vg combine" with the different gfa. Everything else should be similar in the downstream commands!

Thanks a lot, I'll keep in touch!

glennhickey commented 3 years ago

Uh oh, looks like @adamnovak was right about the reference path not being there (you can verify with vg paths -Lv <xg file>). Without it, you won't be able to make a VCF with call.

clairemerot commented 3 years ago

True, vg paths -Lv 09_vgiraffe/vcf_graph/vcf_graph_giraffe_converted.xg (produced by vg convert) gives me nothing while vg paths -Lv 09_vgiraffe/vcf_graph/vcf_graph_giraffe.xg (produced with -R XG argument) gives me the list of Chromosomes in the reference. So if this is fine to use the .xg produced with that option, I'll keep working with it. Thanks, I had totally missed this option on my first look.

clairemerot commented 3 years ago

As a feedback the pipeline (vg autoindex --workflow giraffe -R XG ; vg giraffe; vg pack; vg call ), is giving very promising results. I am now expanding to more samples. Would you recommend to do the vg call with all samples together, or one by one?

I am parellelising vg giraffe and vg pack by samples but I feel that the "call" may benefit for more read support if there is shared polymorphism? Is vg call able to handle multiple samples?

Thanks a lot for all your advice, that helped me a lot! And Giraffe is so fast, I'm impressed!

Claire

adamnovak commented 3 years ago

vg call is I believe still just a single-sample caller (it tries to decide on a genotype for each variant in "the sample", and won't like variants that look like they have unbalanced coverage or more than 2 alleles present), so each sample would need its own call run.

clairemerot commented 2 years ago

Hi, I don't know why I can't see the thread on github but the short answer is that you define snarls and you ask it to genotype all variants if it can. That means you will have 0/0 when it is ref/ref and ./. when it is missing information. Is that correct vgTeam?

I use

vg snarls -t $NB_CPU $graph > $snarls vg call -t $NB_CPU -a -k 06_alignment/"$id".pack -r $snarls -f $ref $graph

07_variants/"$id"_calls.vcf

Good luck Claire

Le mar. 29 mars 2022 à 11:46, MehmetGoktay @.***> a écrit :

Hi Adam and Claire,

Thank you very much for this very useful thread!

But I am still not sure how to expand it to more samples?

If I use the same pipeline individually for each sample and then what I am supposed to do for joined genotyping?

Because if I simply merge all individual VCF files at the end there will be no difference between 0/0 and ./. for the samples which simply does not contain the polymorphism that other samples do.

Best, Mehmet

— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3369#issuecomment-1081658961, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGVYD7SG3SCHA7DHIBMSXBLVCLGOBANCNFSM5BWFTKRQ . You are receiving this because you were mentioned.Message ID: @.***>

adamnovak commented 2 years ago

I think if you want to genotype a set of samples on a consistent set of variants, you use vg pack and vg call on one graph with that set of variants. If you want new candidate SNPs and small indels from the reads to go into the graph, you use vg augment.

You could try augmenting one graph with a bunch of samples (or like a big concatenated GAM file?) and then calling each sample from that. But I don't think we've ever tried it.

Chaima-Bouchenak commented 2 years ago

@adamnovak is a ``vg call still a just single-sample caller ?

i'm working on building a pan-genome for crop breeding as follow: 1 - build a primary graph with my reference 2 - map sample 1 shors reads to it , and get variations 3 - Augmentation of the primary graph with the variations (gam) 4 - use the last graph to map sample 2 reads and get variations 5 - go on with the other samples

i was wondering if i can get an "updated" VCF with +samples each time (which is not the case, but i thought it's due to some missing step in my worflow)

adamnovak commented 2 years ago

@Chaima-Bouchenak We haven't yet added multi-sample calling to vg call, so I don't think you can do exactly what you are describing.

You could iteratively call samples and add them to your graph, and then single-sample map and call all of your samples against the final graph, and you would get some benefit from having the final single-sample calls be done against one fixed collection of variant sites, in terms of making it simpler to merge your final single-sample VCFs.

We might be missing some pieces that would be useful along the loop, though. I think right now you'd have to go graph -> map reads -> vg augment -> vg pack -> vg call -> single sample VCF -> merged VCF of all samples so far -> graph. When you do vg augment you add in all the candidate variants as seen in individual read alignments; a lot of them are actually going to be errors. You probably don't want all the plausible-looking errors from all the samples to end up in your final graph, just the things that actually can be called in at least one sample. So you then wouldn't be able to use vg augment to keep augmenting the graph, but instead you'd have to go back to VCF every time and then re-construct the graph. (Or else use the deprecated vg add.)

Chaima-Bouchenak commented 2 years ago

I'll rather play it safe and avoid vg add for the moment. @adamnovak please let me reformulate your suggestions to make sure that I fully understood :

1 - Make a collection of variant sites in One final graph after mapping all my samples and augmenting the graph each time. Then use each .gamfile ( or should I remap on the final graph?!) to call variants then use bcftools merge to get them all in One VCF. ==>[the + we get a "smooth" multi-sample VCF, the - it may contain errors]

2 - Rebuild the graph each time with the merged VCFs of previous samples (Map, augmente, call with each sample, rebuild graph etc). Finally, merge all the VCFs. ==>[the + avoid errors by building a graph each time and tracking the new variants, the - not having a consensus final VCF as it was merged according to != graphs? ]

I tried to look for a way to store the information from variant sites of each sample-reads to the sample before mapping the next one, is it possible? As I'm also interested in keeping the information of which variants sites result from which sample. not sure if that's the purpose of vg map -N _for --reads input, add this sample_ or vg surejct $sample.gam -N or i can simply

vg construct -r ref.fa -v vg.$sample.vcf -a > graph.vg 
vg paths -L graph.vg | sed ... > $samplepaths.gaf
vg paths -Q _alt_ -d -v graph.vg | vg augment -B -F $samplepathats.gaf > newgraph.vg
smriti-135 commented 1 month ago

Hello. I am facing a similar warning while trying to construct pangenome and call SVs. I am using short read sequences. I am following the pipeline

vg autoindex --> giraffe map --> pack --> snarls --> vg call for SV

Like so

fq="myfastq_trimmed"
prefix="pangenome_graph"
VG=$(which vg)

$VG autoindex --workflow giraffe \
    --prefix "$prefix" \
    --ref-fasta "$fa" \
    --vcf "$vcf" \
    --threads 4 \
    -R XG 

while read sample; do
    echo "Processing sample $sample..."
    $VG giraffe -Z "${prefix}.giraffe.gbz" -m "${prefix}.min" -d "${prefix}.dist" \
        -f "${fq}/SO_11111_${sample}_R1_paired_trimmed.fq.gz" -f "${fq}/SO_11111_${sample}_R2_paired_trimmed.fq.gz" \
        -t 4 -N "$sample" > "mygam/${sample}.gam"
done < sample.lst

The warning I am getting is

Guessing that pangenome_graph.xg is XG 
warning[vg::giraffe]: Refusing to perform too-large rescue alignment of 65 bp against 39878 bp dagified subgraph for read A00609:581:H7KGJDSX5:1:1101:20392:5149 which would use more than 1572864 cells and might exhaust Dozeu's allocator; suppressing further warnings.

The output GAM is being generated. But I don't know if it is going to affect the quality of my output.

Should I change anything? Or use xg instead like given in this thread (but my pipeline works fine for lesser samples, and I was told autoindex pipeline is latest, and "XG graph should be unnecessary these days" )? If not, how to address this?

As a side note. I hope it is not related to this warning I got when running autoindex but could not solve (but all indeces got created without errors)

Warning: insertion END and POS do not agree (complex insertions not canonicalizeable) [canonicalize] Scaffold_Un696 11586 INS00360992 T <INS> 1000
PASS
PRECISE;SVTYPE=INS;SVMETHOD=EMBL.DELLYv0.9.1;END=11587;SVLEN=27;PE=0;MAPQ=0;CT=NtoN;CIPOS=-7,7;CIEND=-7,7;SRMAPQ=60;INSLEN=27;HOMLEN=7;SR=20;SRQ=1;CONSENSUS=AATTAATTGCCCATTAATACGATTCTTGAATTCCAATAACAAAAACGAAAATTTTAAATTAAAAATTTAGGTTTGTGCTCTCTCGCTCTCATTAAAAATTTAGGTTTGTGCTCTCTCTCTCTCTCTCTCTCTCTCTCCCCACCCTAAAAGAAAATTAACATAAATAAATTCTGAAGCGTATTTAA;CE=1.84967;SPAN=27 END: 11587 POS: 11586
adamnovak commented 1 month ago

@smriti-135 the warning you got about a too-large rescue alignment for read A00609:581:H7KGJDSX5:1:1101:20392:5149 means that read A00609:581:H7KGJDSX5:1:1101:20392:5149 is more likely to be mismapped than usual, because we weren't able to pursue one of the possible alignment positions, because the program thought it wouldn't be worth investing the time. Too much of the graph was plausibly close to the mapping of one read in the pair and eligible to be considered to map the other read, so we didn't bother mapping the other read based on the first read's position.

The only way to really make it go away is to change the graph to make the region it is considering aligning to less complex (or smaller), and there are good reasons to not try to do that. I think we have the message mostly to catch the attention of someone who is hand-tuning the alignment parameters or working on the algorithm, or who might be about to really depend on the alignment of one of these reads for some reason. It's normal to get a few of these warnings on a given run. (So maybe we should demote them from warnings to a lower level of severity.)

If you see this a lot, you should make sure your graph has the basic stick-with-some-variants-and-loops structure that Giraffe is designed for, and it might also be worth checking to make sure that your paired end insert sizes aren't weirdly long.

The autoindex warning is unrelated; it is there to tell you that we didn't really understand that variant, because the values in the POS column and the END and SVLEN fields don't make sense together. (Also we can't interpret <INS> because it doesn't give the inserted sequence.) So that variant would have been skipped, but that by itself wouldn't have made the graph too complex to do paired-end rescue against in some regions.