Open jdamas13 opened 1 year ago
Hi @jdamas13, can you confirm you are using vg
1.40.0?
For clarity, this is not the most recent vg version. There has been a regression in vg deconstruct in recent versions, and only a specific range of versions, ending at 1.40.0, will work.
Hi, I was using the nf-core/pangenome dev docker container, which has vg: variation graph tool, version v1.40.0 "Suardi".
I got the same error using latest singularity version.
vg deconstruct -P Cantata -H # -e -a -t 4 community.9/pg2-pg5_prefixed-50kb.community.9.fa.gz.bf3285f.11fba48.867196c.smooth.final.gfa
457.70s user 15.60s system 222% cpu 213.00s total 2875776Kb max memory
[vg::deconstruct] decompose VCF
vcfwave 1.0.7 processing...
error: more sample names in header than sample fields
samples: PG5
line: Cantata#1#Scf9YQZ_25_HRSCAF_39 19 >1823810>1823812 CC C 60.0 . AC=0;AF=0;AN=0;AT=<1823812<1823811<1823810,<1823812<1823810;NS=0;LV=0 GT
Command exited with non-zero status 1
vg: version: v1.40.0 deconstruct: Cantata:1000 reporting: version: v1.21 multiqc: true
This looks to be the same error as https://github.com/ComparativeGenomicsToolkit/cactus/issues/1416 and possibly https://github.com/ComparativeGenomicsToolkit/cactus/issues/1402
I'm looking into it now, and it seems to be caused by:
vg deconstruct
writes genotype as .
vcfbub
replaces the .
genotype with a completely empty column, producing an invalid VCF.It seems strange that this error is only now coming up, as vg deconstruct
and vcfbub
haven't changed much at all lately (tho deconstruct
will be very refactored in the next vg release). Update I just noticed the original issue here is a year old -- makes more sense!
I am going to double-check the deconstruct
end today (I think the . genotype is coming from its conflict resolution and is probably by design). But it seems like there is a bug in vcfbub
(by way of the api its using to write VCF) that by stripping .
genotype columns produces invalid VCF. @ekg @AndreaGuarracino let me know if you want some data to reproduce.
Would love data please send!
Here is a VCF file that vcfbub
invalidates by virtue of erasing the sample column for records where the GT is .
wget -q http://public.gi.ucsc.edu/~hickey/debug/region.vcf.gz
zcat region.vcf.gz |awk '{print $1 "\t" $2 "\t" $9 "\t" $10}' | tail -5
NC_054371.1 30059010 GT 1
NC_054371.1 30059019 GT .
NC_054371.1 30059027 GT 1
NC_054371.1 30059035 GT 1
NC_054371.1 30059046 GT .
vcfbub --input region.vcf.gz --max-ref-length 100000 --max-level 0 > region.bub.vcf
tail -5 region.bub.vcf
cat region.bub.vcf | awk '{print $1 "\t" $2 "\t" $9 "\t" $10}' | tail -5
NC_054371.1 30059010 GT 1
NC_054371.1 30059019 GT
NC_054371.1 30059027 GT 1
NC_054371.1 30059035 GT 1
NC_054371.1 30059046 GT
bcftools view region.vcf.gz > /dev/null
# fine
bcftools view region.bub.vcf > /dev/null
[E::vcf_parse_format_empty1] FORMAT column with no sample columns starting at NC_054371.1:30057051
[E::vcf_parse_format_empty1] FORMAT column with no sample columns starting at NC_054371.1:30057221
[E::bcf_write] Broken VCF record, the number of columns at NC_054371.1:30057051 does not match the number of samples (0 vs 1)
[main_vcfview] Error: cannot write to (null)
Hi, I am trying to generate a pangenome using two genome assemblies of the same species. I started by running partition_before_pggb, and now I am running the pggb command for each community. My jobs are being killed at the vg deconstruct step. The message I am getting is shown below. I notice that for every chromosome, the run errors for the first line in the vcf file that has the CONFLICT flag.
Do you know why this is happening and how to fix it? I appreciate any help you can provide.