vg autoindex: htslib complains about broken VCF record when using input from Ensembl

johanneskoester commented 1 year ago

1. What were you trying to do? Running vg autoindex on yeast assembly together with known variants from ensembl:

vg autoindex -t 2 --workflow giraffe -r resources/genome.dna.saccharomyces_cerevisiae.R64-1-1.100.fasta -v resources/variation.vcf.gz -p x --prefix resources/genome.dna.saccharomyces_cerevisiae.R64-1-1.100

2. What did you want to happen?

The index to be created.

3. What actually happened?

vg fails to chunk. The input VCF does not contain any sample definition, but somehow htslib complains about a missing sample in the first record. Could be a bug that might be fixed by updating the htslib dependency (the same error occurs in an at first sight unrelated and already fixed scenario here)? I am however suprised that this has not been seen before. I wonder if I miss something.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

[vg autoindex] Executing command: vg autoindex -t 2 --workflow giraffe -r resources/genome.dna.saccharomyces_cerevisiae.R64-1-1.100.fasta -v resources/variation.vcf.gz -p x --prefix resources/genome.dna.saccharomyces_cerevisiae.R64-1-1.100
[IndexRegistry]: Checking for phasing in VCF(s).
[IndexRegistry]: Chunking inputs for parallelism.
[IndexRegistry]: Chunking FASTA(s).
[IndexRegistry]: Chunking VCF(s).
[E::bcf_write] Broken VCF record, the number of columns at I:84 does not match the number of samples (0 vs 1)
error:[IndexRegistry] error writing VCF line to /tmp/vg-ANJBbS/dir-Qa9PgR/d44a1cf9b763bb44cc91ba9ee6abcc0bc8a368ec.0.chunked.vcf.gz

5. What data and command can the vg dev team use to make the problem happen?

6. What does running vg version say?

1.43.0

jeizenga commented 1 year ago

This PR should resolve your issue https://github.com/vgteam/vg/pull/3778. One caveat I have though is that vg giraffe works best when there are full haplotypes available from the VCF, especially when the variant density is high. With this VCF, I would not be very surprised if vg giraffe's performance is a bit sub-par.

johanneskoester commented 1 year ago

Good point. This was meant as a first test. Actually I would love to use the T2T pangenome that @aphillippy was presenting yesterday at #biodata22. So far I have not found a corresponding resource though. Do you happen to know from there that can be obtained?

jeizenga commented 1 year ago

I don't know about those resources in particular, but the HPRC has produced pangenome graphs from diploid assemblies that are "near T2T quality", which can be found here: https://github.com/human-pangenomics/hpp_pangenome_resources

johanneskoester commented 1 year ago

Awesome, thank you so much. I think this is what Adam was referring to.

johanneskoester commented 1 year ago

Nevertheless, could you also release a new version with above fix?

johanneskoester commented 1 year ago

My intuition would be that even if for a species there are just unphased variants available, one might be able to benefit from aligning against the resulting graph because of more accurate mapping qualities. Is that correct?

johanneskoester commented 1 year ago

Along these lines another short question if you don't mind: for the HPRC stuff, I'd like prefer to use the VCF as input to the index, as that seems more future proof and does not need to be kept in sync with particular vg releases. However, will running autoindex on that yield a graph of comparable quality as the corresponding minimap/cactus graph?

jeizenga commented 1 year ago

We do our releases on a fixed schedule. The next one is planned to be on 11/21, so not too far out.

I think your intuition about the unphased variants is correct about pangenome methods overall, but less true for vg giraffe. It's really focused on aligning to haplotype paths. Without phased variants, it's forced to "manufacture" them, which means it might spend its effort aligning to paths that are not biologically relevant (especially if the variant density is high). vg map and vg mpmap, which align directly to the graph topology, have only a touch of performance degradation without haplotypes.

The graph made from the VCF will be nearly identical in most genomic regions. Where the quality might decline is in regions where there is complex variation that is not easily represented by the VCF format's substitution model.

johanneskoester commented 1 year ago

Ok, sounds good with the release ;-).

Btw. a suggestion: with Snakemake and many other projects, I use release-please, which greatly simplifies the release process by automatically creating releases (and builds) based on conventional commits (or also PRs with conventional titles).

Ok, got it. So we should probably only use giraffe in the pipeline in case of human for now, where the phased graph is available.

I've thought about it and I guess it makes sense to just use the filtered graph gfa file designed for giraffe, but create the auxilliary files with autoindex then. This way, we remain independent of format changes and use a stable well specified input format, but still have the full variation represented in the graph as input instead of an undercomplex representation at those complex regions that we would get with the VCF.

vgteam / vg

vg autoindex: htslib complains about broken VCF record when using input from Ensembl #3776