vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Discrepancy in SV Counts between vg deconstruct and Input SV Data #4185

Open YibinQiu opened 9 months ago

YibinQiu commented 9 months ago

1. What were you trying to do? I am currently working on constructing a pan-genome using the vg autoindex workflow with the Giraffe process. The input.sv.vcf.gz file includes data from 182 individuals, containing a total of 3.8 million SVs. In the initial run of vg autoindex --workflow giraffe, and subsequent SV genotyping using vg call (-a -z -A), only a fraction of SVs were successfully genotyped. To increase SV inclusion in the graph, we modified the genotypes of each SV in each individual to 0|1, 1|1, where the majority were originally denoted as .|.. Subsequently, I rerun autoindex and genotype structural variants (SVs) using vg giraffe, vg pack -Q 5 and vg call (-a -z -A).

2. What did you want to happen? I expected the SV genotyping results (output.sv.vcf) to be consistent with the input SV data (input.sv.vcf.gz) after deconstructing multi-allelic SVs using bcftools norm -m -any.

3. What actually happened? After the second attempt, while the SV count increased (around 1.8 million), it is still noticeably lower compared to the original input.sv.vcf.gz file. Additionally, we attempted to follow the example provided in https://github.com/vgteam/vg/wiki/Construction-Examples, specifically using vg construct (-a -S -R -C) to construct each chromosome individually. However, we encountered an issue with longer chromosomes, receiving the error "vg.Graph exceeded the maximum protobuf size of 2GB: 4415154172." (#3979,#4152)

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Place stacktrace here.

5. What data and command can the vg dev team use to make the problem happen? We modified the genotypes of each SV in each individual to 0|1, 1|1 in the input.sv.vcf.gz file following:

for chr in ${chrs[*]}
do
    echo  $(date +"%Y-%m-%d %T") start convert ${chr}
    awk -v OFS="\t" '
    BEGIN {FS=OFS; srand()}
    {
        if (/^#/) {
            print;
        } else {
            # arbitrarily assign ./. to 0|1 1|1
            for (i=10; i<=NF; i++) {
                if ($i == ".") {
                    $i = (int(rand() * 2) == 0) ? "0/1" : "1/1";
                }
                gsub(/\//, "|", $i);
            }
            print; 
        }
    }' inputs.${chr}_sv.vcf > convert.inputs.${chr}_sv.vcf
done

We run autoindex --workflow giraffe following:

echo  $(date +"%Y-%m-%d %T") begin vg autoindex
/bin/time -v -p vg autoindex --workflow giraffe -r $ref $(for chr in ${chrs[*]}; do echo -v convert.inputs.${chr}_sv.vcf.gz; done) --tmp-dir ${LARGE_DISK_TMP} -t 24 -p out
echo  $(date +"%Y-%m-%d %T") done vg autoindex
/bin/time -v -p vg snarls ${GRAPHS}/out.giraffe.gbz >${GRAPHS}/out.snarls
echo  $(date +"%Y-%m-%d %T") done vg snarls

We first deconstruct graph and compared with the original input.sv.vcf.gz file. We found some SVs were lost in input.sv.vcf.gz file.

vg deconstruct -t 24 ${GRAPHS}/out.giraffe.gbz -r ${GRAPHS}/out.snarls -p chr1 -a >${GRAPHS}/chr1out.vcf

新建 Microsoft PowerPoint 演示文稿

6. What does running vg version say?

vg version v1.53.0 "Valmontone"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by anovak@courtyard.gi.ucsc.edu