vc deconstruct error - more sample names in header than sample fields

jdamas13 commented 1 year ago

Hi, I am trying to generate a pangenome using two genome assemblies of the same species. I started by running partition_before_pggb, and now I am running the pggb command for each community. My jobs are being killed at the vg deconstruct step. The message I am getting is shown below. I notice that for every chromosome, the run errors for the first line in the vcf file that has the CONFLICT flag.

Do you know why this is happening and how to fix it? I appreciate any help you can provide.

[vg::deconstruct] decompose VCF
error: more sample names in header than sample fields
samples: query
line: ref#1#chr5  24977031        >61682>61685    A       C       60.0    .       AC=0;AF=0;AN=0;AT=>61682>61683>61685,>61682>61684>61685;CONFLICT=query;NS=0;LV=1;PS=>61254>62671   GT
Command exited with non-zero status 1

AndreaGuarracino commented 1 year ago

Hi @jdamas13, can you confirm you are using vg 1.40.0?

ekg commented 1 year ago

For clarity, this is not the most recent vg version. There has been a regression in vg deconstruct in recent versions, and only a specific range of versions, ending at 1.40.0, will work.

jdamas13 commented 1 year ago

Hi, I was using the nf-core/pangenome dev docker container, which has vg: variation graph tool, version v1.40.0 "Suardi".

vatanparast commented 3 months ago

I got the same error using latest singularity version.

vg deconstruct -P Cantata -H # -e -a -t 4 community.9/pg2-pg5_prefixed-50kb.community.9.fa.gz.bf3285f.11fba48.867196c.smooth.final.gfa
457.70s user 15.60s system 222% cpu 213.00s total 2875776Kb max memory
[vg::deconstruct] decompose VCF
vcfwave 1.0.7 processing...
error: more sample names in header than sample fields
samples: PG5
line: Cantata#1#Scf9YQZ_25_HRSCAF_39    19      >1823810>1823812        CC      C       60.0    .       AC=0;AF=0;AN=0;AT=<1823812<1823811<1823810,<1823812<1823810;NS=0;LV=0   GT
Command exited with non-zero status 1

vg: version: v1.40.0 deconstruct: Cantata:1000 reporting: version: v1.21 multiqc: true

glennhickey commented 3 months ago

This looks to be the same error as https://github.com/ComparativeGenomicsToolkit/cactus/issues/1416 and possibly https://github.com/ComparativeGenomicsToolkit/cactus/issues/1402

I'm looking into it now, and it seems to be caused by:

vg deconstruct writes genotype as .
then vcfbub replaces the . genotype with a completely empty column, producing an invalid VCF.

It seems strange that this error is only now coming up, as vg deconstruct and vcfbub haven't changed much at all lately (tho deconstruct will be very refactored in the next vg release). Update I just noticed the original issue here is a year old -- makes more sense!

I am going to double-check the deconstruct end today (I think the . genotype is coming from its conflict resolution and is probably by design). But it seems like there is a bug in vcfbub (by way of the api its using to write VCF) that by stripping . genotype columns produces invalid VCF. @ekg @AndreaGuarracino let me know if you want some data to reproduce.

ekg commented 3 months ago

Would love data please send!

glennhickey commented 3 months ago

Here is a VCF file that vcfbub invalidates by virtue of erasing the sample column for records where the GT is .

wget -q http://public.gi.ucsc.edu/~hickey/debug/region.vcf.gz
zcat region.vcf.gz |awk '{print $1 "\t" $2 "\t" $9 "\t" $10}' | tail -5
NC_054371.1     30059010        GT      1
NC_054371.1     30059019        GT      .
NC_054371.1     30059027        GT      1
NC_054371.1     30059035        GT      1
NC_054371.1     30059046        GT      .

vcfbub --input region.vcf.gz --max-ref-length 100000 --max-level 0 > region.bub.vcf
tail -5 region.bub.vcf
cat region.bub.vcf  | awk '{print $1 "\t" $2 "\t" $9 "\t" $10}' | tail -5
NC_054371.1     30059010        GT      1
NC_054371.1     30059019        GT
NC_054371.1     30059027        GT      1
NC_054371.1     30059035        GT      1
NC_054371.1     30059046        GT

bcftools view region.vcf.gz > /dev/null
# fine

bcftools view region.bub.vcf > /dev/null
[E::vcf_parse_format_empty1] FORMAT column with no sample columns starting at NC_054371.1:30057051
[E::vcf_parse_format_empty1] FORMAT column with no sample columns starting at NC_054371.1:30057221
[E::bcf_write] Broken VCF record, the number of columns at NC_054371.1:30057051 does not match the number of samples (0 vs 1)
[main_vcfview] Error: cannot write to (null)

pangenome / pggb

vc deconstruct error - more sample names in header than sample fields #287