pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
346 stars 37 forks source link

A null vcf file #377

Closed dudududu12138 closed 4 months ago

dudududu12138 commented 4 months ago

Hello, I just test pggb on a simulated data. I simulated a reference file(.fasta), the length of this reference is 3941bp. Then I simulated a query file(.fasta),the length of this query is 600bp. This query sequence was taken from bases 1-300 and 800-1200 of the ref sequence. I use this 2 sequences to simulate deletion. But the final result report a null vcf file. This is my code:

pggb -i tmp.fa -o output -n 2 -t 2  -p 50 -s 100  -V 'ref'

Below is the final vcf file: image And below is my simulated data which contains the reference and query sequences: tmp.txt

subwaystation commented 4 months ago

@dudududu12138 I can reproduce. Your *.seqwish.gfa should contain still both your input sequences as paths. However, the query path is lost after the smoothxg step. @AndreaGuarracino you have any idea why?

AndreaGuarracino commented 4 months ago

Working on that! https://github.com/pangenome/smoothxg/pull/206

AndreaGuarracino commented 4 months ago

@dudududu12138, I've fixed the issue. Please update your pggb installation (I see you are not using the latest), being sure to update also smoothxg.

For your specific case, I suggest 3 modifications:

The 2nd and 3rd modifications are needed because pggb and its tools are tuned for longer sequences. If you apply all those 3 changes, you will get such a VCF file:


##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=CONFLICT,Number=.,Type=String,Description="Sample names for which there are multiple paths in the graph with conflicting alleles">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
##contig=<ID=ref#1#r1,length=3941>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  query#1#q1
ref#1#r1    300 >1>3    CCCTTGTGATCTGCTTAGTTCCCACCCCCCTTTAAGAATTCAATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCAGTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGGTAGAGTTTCAGGCTGGGGTGATGGAAGGGTGCTGGAAATGAGTGGTAGTGATGGCGGCACAACAGTGTGAATCTACTTAATCCCACTGAACTGTATGCTGAAAAATGGTTTAGACGGTGAATTTTAGGTTATGTATGTTTTACCACAATTTTTAAAAAGCTAGTGAAAAGCTGGTAAAAAGAAAGAAAAGAGGCTTTTTTAAAAAGTTAAATATATAAAAAGAGCATCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCTGGAATCCGTTGGCTTGCCTCCGGCATTTTTGGCCCTTGCCTTTTAGGGTTGCCAGATTAAAAGACAGGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGC   C   60  .   AC=1;AF=1;AN=1;AT=>1>2>3,>1>3;NS=1;LV=0 GT  1