vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

Some questions about the <vg call> commond output #4259

Open Kled111 opened 3 months ago

Kled111 commented 3 months ago

When I successfully use commond "vg giraffe -Z a.gbz -m a.min -d a.dist -f sim.fq >mapped.gam" I follow the https://github.com/vgteam/vg/wiki/SV-Genotyping-and-variant-calling and use vg augment -> vg snarls ->vg pack -> vg call. And I get a "whole-genome-calls.vcf " as the VG wiki said. What is the vg call commond “-S” ?

The output vcf file is like this:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR15368033

chr1 292 >73711>73698 T TAACCCTAACCCT 1702.3 PASS AT=>73711>73698,>73711<73710<73709<73708<73706<73703<73702<73701<73699>73698;DP=98 GT:DP:AD:GL:GQ:GP:XD:MAD 0/1:98:12,86:-185.838884,-16.991398,-16.143265:2:-2.735818:118.137253:12

The ID field here, what does it mean, and how can I locate the specific chromosomal position?

adamnovak commented 2 months ago

We generate the IDs by using the node numbers for the bounding nodes around the variable site in vg's graph, and giving their orientations using > and < characters. So >73711>73698 is the variable site between the end of node 73711 (read left to right) and the start of node 73698 (read left to right). We call a site like this a "snarl".

I'm not sure what you mean about locating the specific chromosomal position. This is VCF on a linear reference, so each variant has its chromosomal position in the CHROM and POS fields of the VCF record.

The vg call option -S is documented as:

-S, --ref-sample NAME   Call on all paths with given sample name (cannot be used with -p)

This is used to specify which linear reference in the graph you want the output VCF in terms of. If you have a graph that has the GRCh37, GRCh38, and CHM13 references all in it, you can use -S CHM13 to get calls against the CHM13 reference. Internally we treat named reference assemblies and named non-reference sample assemblies in a graph both as "samples" containing multiple contigs.

Kled111 commented 2 months ago

We generate the IDs by using the node numbers for the bounding nodes around the variable site in vg's graph, and giving their orientations using > and < characters. So >73711>73698 is the variable site between the end of node 73711 (read left to right) and the start of node 73698 (read left to right). We call a site like this a "snarl".

I'm not sure what you mean about locating the specific chromosomal position. This is VCF on a linear reference, so each variant has its chromosomal position in the CHROM and POS fields of the VCF record.

The vg call option -S is documented as:

-S, --ref-sample NAME   Call on all paths with given sample name (cannot be used with -p)

This is used to specify which linear reference in the graph you want the output VCF in terms of. If you have a graph that has the GRCh37, GRCh38, and CHM13 references all in it, you can use -S CHM13 to get calls against the CHM13 reference. Internally we treat named reference assemblies and named non-reference sample assemblies in a graph both as "samples" containing multiple contigs.

Thank you for your response. The content previously shown represents the results obtained using -S CHM13. However,when I find the specific sequence of CHM13 on https://genome.ucsc.edu/, it was found that the content in the ALT field corresponds to the sequence of CHM13. Is this perhaps not entirely accurate?

adamnovak commented 2 months ago

the content in the ALT field corresponds to the sequence of CHM13

That shouldn't happen; if you call against CHM13, the REF field should have the sequence that is in CHM13, and the ALT field should have different sequence(s).

If you have a variant where the ALT field has the sequence that's in CHM13 on the browser, what sequence is in the REF field?

Also, are you sure you are using the same CHM13 in your graph as is in the browser? The browser is on the 2.0 version, but there were 1.0 and 1.1. releases.

adamnovak commented 2 months ago

Maybe you're looking at a tandem insertion?

If you look at your variant:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR15368033
chr1 292 >73711>73698 T TAACCCTAACCCT 1702.3 PASS AT=>73711>73698,>73711<73710<73709<73708<73706<73703<73702<73701<73699>73698;DP=98 GT:DP:AD:GL:GQ:GP:XD:MAD 0/1:98:12,86:-185.838884,-16.991398,-16.143265:2:-2.735818:118.137253:12

You can go to chr1:292-312 on the browser and get the sequence:

>hub_3671779_hs1_dna range=chr1:292-312 5'pad=0 3'pad=0 strand=+ repeatMasking=none
TAACCCTAACCCTAACCCTAA

And so you can see that the ALT sequence TAACCCTAACCCT is the same as the sequence in CHM13 at that position. But the REF sequence for this variant is just T, so the variant represents adding another copy of AACCCTAACCCT between the T and the existing copy; it's not a mistake in this case that the variant's ALT sequence is the same as the CHM13 reference's sequence at the variant's start position.

Kled111 commented 2 months ago

Maybe you're looking at a tandem insertion?

If you look at your variant:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR15368033
chr1 292 >73711>73698 T TAACCCTAACCCT 1702.3 PASS AT=>73711>73698,>73711<73710<73709<73708<73706<73703<73702<73701<73699>73698;DP=98 GT:DP:AD:GL:GQ:GP:XD:MAD 0/1:98:12,86:-185.838884,-16.991398,-16.143265:2:-2.735818:118.137253:12

You can go to chr1:292-312 on the browser and get the sequence:

>hub_3671779_hs1_dna range=chr1:292-312 5'pad=0 3'pad=0 strand=+ repeatMasking=none
TAACCCTAACCCTAACCCTAA

And so you can see that the ALT sequence TAACCCTAACCCT is the same as the sequence in CHM13 at that position. But the REF sequence for this variant is just T, so the variant represents adding another copy of AACCCTAACCCT between the T and the existing copy; it's not a mistake in this case that the variant's ALT sequence is the same as the CHM13 reference's sequence at the variant's start position.

Thanks, I tried to examine many entries on chr1 and found that the situation is basically as you described. The 'ref' still corresponds to the sequence of CHM13, but many of the preceding contents of the vcf file happen to match the 'alt' sequence as well. Apart from this, I also want to ask, after using the '-S' parameter, is most of the content in the output VCF file actually the detection results of variations between other samples and CHM13 on the original pan-genome graph? (I used the file https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.gbz along with a short read sequencing fastq file (obtained through vg giraffe)). I noticed that the probability values calculated through the GL field are all very small. If I want to know the variations between the sequencing sample and the other samples that constitute the pan-genome, what should I do?"

adamnovak commented 2 months ago

Apart from this, I also want to ask, after using the '-S' parameter, is most of the content in the output VCF file actually the detection results of variations between other samples and CHM13 on the original pan-genome graph?

A lot of the snarls in the graph will be places where other samples in the graph differ from CHM13 but your sample matches it. If you use --genotype-snarls I think vg snarls emits all of these. Otherwise I think it only talks about 0/0 calls if there was some plausible chance they wouldn't have been called as 0/0. So I think your VCF should by default only contain descriptions of where your particular sample might vary from the linear reference you chose, but those descriptions will be informed by what other samples do at those locations.

I noticed that the probability values calculated through the GL field are all very small.

I think this is normal. The GL probabilities are the probability of getting the sequencing data that you actually have, given that the genotype is true. There's a very low probability of observing any particular sequencing data. I think it is computing something like "if my genotype is A/T, what is the chance that I sequence exactly 12 reverse-strand As, 11 forward-strand As, 7 reverse-strand Ts, 5 forward-strand Ts, and one forward-strand G?".

If I want to know the variations between the sequencing sample and the other samples that constitute the pan-genome, what should I do?

I don't think you can plug each sample in the pangenome in as -S and get a VCF against it, because most of the samples in https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.gbz are diploid samples. They have two haplotypes in them, and so you can't call against the sample as a whole; you'd have to call against one haplotype or the other. And maybe @glennhickey knows something different, but I don't think that vg call lets you ask for that.

One option is to take your VCF from vg call, and use vg deconstruct to make a VCF describing the genotypes for all the samples in the graph, against the same reference. Then you could compare the two VCF files to see how your called sample relates to each of the samples already in the graph.

Another option would be to take a tool like the Sequence Tube Map and use it to visualize your sequencing reads as aligned to the graph, alongside the locally distinct haplotypes of samples stored in the graph. So if you have a VCF call for your sample you want to know more about, you could look at the reads that gave rise to it and determine whether they agree with any known variation or any haplotypes already in the graph. But this approach aggregates across the haplotypes stored in the graph and combines them when they match, so it won't let you get a list of, for example, the names of samples that have the same genotype as your sample does.

adamnovak commented 2 months ago

Another tip: we've mostly been using the "AF-Filtered VG Indexes" version of the graph for mapping and calling: https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.d9.gbz. This throws out anything that doesn't appear at least a certain number of times (I think 9 times), and we think it gets better mapping and calling accuracy. When the probability that your sample actually has any given piece of sequence that's in the graph starts to get small enough that it is close to the probability that one of your sequencing reads has an error that would match that piece of sequence, including more rare variants in the graph can start to be counterproductive for read mapping.