Closed dudududu12138 closed 4 months ago
@dudududu12138 I can reproduce. Your *.seqwish.gfa
should contain still both your input sequences as paths. However, the query
path is lost after the smoothxg
step. @AndreaGuarracino you have any idea why?
Working on that! https://github.com/pangenome/smoothxg/pull/206
@dudududu12138, I've fixed the issue. Please update your pggb
installation (I see you are not using the latest), being sure to update also smoothxg
.
For your specific case, I suggest 3 modifications:
pggb
script to add -c 100
where wfmash
is called (unfortunately, this parameter is not exposed in pggb's interface)-l 0
to your command line, so pggb -i tmp.fa -o output -n 2 -t 2 -p 50 -s 100 -l 0 -V 'ref'
The 2nd and 3rd modifications are needed because pggb
and its tools are tuned for longer sequences. If you apply all those 3 changes, you will get such a VCF file:
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=CONFLICT,Number=.,Type=String,Description="Sample names for which there are multiple paths in the graph with conflicting alleles">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
##contig=<ID=ref#1#r1,length=3941>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT query#1#q1
ref#1#r1 300 >1>3 CCCTTGTGATCTGCTTAGTTCCCACCCCCCTTTAAGAATTCAATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCAGTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGGTAGAGTTTCAGGCTGGGGTGATGGAAGGGTGCTGGAAATGAGTGGTAGTGATGGCGGCACAACAGTGTGAATCTACTTAATCCCACTGAACTGTATGCTGAAAAATGGTTTAGACGGTGAATTTTAGGTTATGTATGTTTTACCACAATTTTTAAAAAGCTAGTGAAAAGCTGGTAAAAAGAAAGAAAAGAGGCTTTTTTAAAAAGTTAAATATATAAAAAGAGCATCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCTGGAATCCGTTGGCTTGCCTCCGGCATTTTTGGCCCTTGCCTTTTAGGGTTGCCAGATTAAAAGACAGGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGC C 60 . AC=1;AF=1;AN=1;AT=>1>2>3,>1>3;NS=1;LV=0 GT 1
Hello, I just test pggb on a simulated data. I simulated a reference file(.fasta), the length of this reference is 3941bp. Then I simulated a query file(.fasta),the length of this query is 600bp. This query sequence was taken from bases 1-300 and 800-1200 of the ref sequence. I use this 2 sequences to simulate deletion. But the final result report a null vcf file. This is my code:
Below is the final vcf file: And below is my simulated data which contains the reference and query sequences: tmp.txt