vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Missing genotypes in vcf file produced by vg deconstruct #3756

Closed Yutang-ETH closed 1 year ago

Yutang-ETH commented 1 year ago

1. What were you trying to do? Try to identify variants by vg deconstruct with gfa file produced by PGGB

2. What did you want to happen? I would like to get a vcf file from vg deconstruct

3. What actually happened? I did get a vcf file from vg deconstruct, but some genotypes are missing in the vcf. For example, there are 14 samples in the vcf, but some lines only have 13 genotypes. There's one genotype missing, either no "." or no number. I found this when I was using bcftools view to check the file, bcftools view myvcf.vcf | tail, returns me,

(vg) bcftools view rabiosa_deconstruct_unphased.vcf | tail
[W::vcf_parse_info] INFO 'CONFLICT' is not defined in the header, assuming Type=String
[E::vcf_parse_format] Couldn't read GT data: value not a number or '.' at rabiosa_0_chr1:3878714
[E::vcf_parse_format] Couldn't read GT data: value not a number or '.' at rabiosa_0_chr1:3878719
Error: VCF parse error

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

None. Vg deconstruct finished without error.

5. What data and command can the vg dev team use to make the problem happen?

vg deconstruct -p rabiosa_0_chr1 -p rabiosa_0_chr2 -p rabiosa_0_chr3 -p rabiosa_0_chr4 -p rabiosa_0_chr5 -p rabiosa_0_chr6 -p rabiosa_0_chr7 -a -e -t 48 --ploidy 1 -v ${mygfa}/Rabiosa.gfa > rabiosa_deconstruct_unphased.vcf

6. What does running vg version say?

vg version v1.43.0 "Barisano"
Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Linux
Linked against libstd++ 20210601
Built by anovak@octagon
glennhickey commented 1 year ago

Hmm, that is weird. Does it work properly when using vg v1.4.0 ? If so, then it's related to #3753 where the new "path sense" logic is causing trouble with GFA input.

In either case, if you can share your input or enough of a subset of it to reproduce the problem, that would help with debugging.

Yutang-ETH commented 1 year ago

Thank you very much for your reply. Not sure about vg v1.4.0 but I tried 1.41.0 and I got the same error. I am new to the pangenome field, so I don't think I really understand the meaning of "path sense", if it is possible, could you please explain more about that to me? May I have your email address? I could send a link to you to download our gfa file. What do you think?

Thank you very much again for your help.

Best wishes, Yutang

glennhickey commented 1 year ago

Sure, it's my username here at gmail. But please try v1.40 if you can (sorry for my typo above when I wrote 1.4.0)

Yutang-ETH commented 1 year ago

Thank you very much. Probably, let me try v 1.40 first and then I will get back to you. I think I will just keep this thread open for a while.

Best wishes, Yutang

Yutang-ETH commented 1 year ago

Hi Glenn,

Good news, I changed my command and tried v 1.40 and v 1.43, the issue is gone! So probably there's no bug if I use the command correctly.

This time I didn't set polyploidy to 1 but just use the default 2, and I specify -H "?"

vg deconstruct -p rabiosa_0_chr1 -p rabiosa_0_chr2 -p rabiosa_0_chr3 -p rabiosa_0_chr4 -p rabiosa_0_chr5 -p rabiosa_0_chr6 -p rabiosa_0_chr7 -H '?' -a -e -t 48 -v ${mygfa}/Rabiosa.gfa > rabiosa_deconstruct_all_vg140.vcf
vg deconstruct -p rabiosa_0_chr1 -p rabiosa_0_chr2 -p rabiosa_0_chr3 -p rabiosa_0_chr4 -p rabiosa_0_chr5 -p rabiosa_0_chr6 -p rabiosa_0_chr7 -H '?' -a -e -t 48 -v ${mygfa}/Rabiosa.gfa > rabiosa_deconstruct_all_vg143.vcf

Then I checked the vcf file by

bcftools view  rabiosa_deconstruct_all_vg140.vcf | tail
bcftools view  rabiosa_deconstruct_all_vg143.vcf | tail

The command just ran through smoothly and printed out the last a few lines. So, I guess it is fine.

A little background to make things more clear here, I am studying a human-size heterozygous diploid plant genome, I made a unphased reference and based on the unphased reference I made two phased assemblies representing two haplotypes. All assemblies are chromosome-level and I used pggb to construct the pangenome with these three assemblies (They are named as rabiosa_0_chr1, rabiosa_1_chr1 and rabiosa_2_chr1 and so on). Now, what I want to do is find variations between these assemblies by vg deconstruct and then since we have some short reads from a F1 population, I want to map these reads to the graph to genotype the variants we find from vg deconstruct. With the genotype in the F1 population, probably we could do some association test.

Which genotyping tool would you recommend based on our case, vg giraffe + vg call or pangenie? Thank you very much.

Best wishes, Yutang

glennhickey commented 1 year ago

Glad you got it working, but I guess it's still quite possible there is a bug related to ploidy=1.

I think pangenie is will usually have better genotype accuracy than vg call, especially at low read coverage. But it may be worth trying them both.

Yutang-ETH commented 1 year ago

Thank you very much for your reply and suggestion. Yes, I agree, there might be a little bug with ploidy=1 and I will try both strategies for genotyping variants as you suggested.

Best wishes, Yutang

ASLeonard commented 1 year ago

Did you need any additional datasets to debug this on? I've been running a similar command to Yutang, focusing on ploidy=1 (-a -e -d 1). I also see the issue in v1.40 as well as the other more recent releases.

This was the distribution for the number of genotypes per entry, so overwhelmingly they are printed correctly but there are some where 10 out of the 11 GTs are non-existent and just appear as double tabs (so can be fixed by sed etc).

      8 1
     56 2
    281 3
    271 4
    512 5
   1196 6
   1750 7
   1642 8
   3697 9
   2753 10
1037682 11

I tried as well with the most recent image (vg_ci-516-e4595514ba210f1c5ca614071e4b80cedc32ef83), and the numbers were a bit different, but not sure if the algorithm is expected to be deterministic or not.

      8 1
     55 2
    278 3
    275 4
    514 5
   1195 6
   1747 7
   1642 8
   3697 9
   2753 10
1037684 11
glennhickey commented 1 year ago

@ASLeonard this definitely looks like a bug. Any datasets to reproduce would be appreciated.

ASLeonard commented 1 year ago

@glennhickey

I tried slicing part of the graph out, but it messed up all the P-lines and so the resulting vcf wasn't very useful. The full graph is 113 Mb after compression, and hopefully should be accessible here. I ran vg deconstruct -p HER -a -e -d 1 -t 2 with vg 1.44.0 and got this vcf. The first sites missing a genotype are

HER:63690 HER:63749

The graph itself was created by cactus 2.3.0 along these lines.

Let me know if I can provide anything else. Best, Alex