Closed JakeHagen closed 2 months ago
Thanks for this @JakeHagen! I think -G/--drop-genotypes
would be a nice addition, so we'd be happy to accept a PR for that.
I'm not sure that the VEP pipeline fits with the processing model we are trying to encourage here. This tool (vcztools) is for streaming small parts of the VCF Zarr store, it's not really for processing the full set of variants in the store. For that we have sgkit which can do out-of-core parallel processing.
Having said that, I think we'd be keen to understand better what full pipeline you are proposing here is used for. What would you do with the 100k_sample.vcz
file in your example? Would it be used to select a subset of variants from the original store and then apply some analysis to them?
We have had various discussions about how to integrate VEP in the past (see https://github.com/sgkit-dev/sgkit/pull/1083, and https://github.com/search?q=repo%3Asgkit-dev%2Fsgkit+vep&type=issues more generally), so I think we'd like to accommodate it somehow in the sgkit-dev ecosystem.
Tagging @jeromekelleher and @benjeffery who may have more to say about this (although Jerome is away at the moment).
Hi Tom
Thanks for the reply.
How I envision this working in our lab is the data would be joint called -> ingested into Zarr -> annotated -> subset by lab members for different analysis.
Since Im guessing very few annotation tools will operate directly on the Zarr format, annotation would need to be done before ingestion.
Having an info ingestion feature seems like it would:
These features would solve a lot of the pain points our lab has around annotation.
I am still learning about the Zarr format and sgkit ecosystem so maybe this is not practical or made pointless by a different tool but it seems like it could be a valuable feature of Zarr vcf to me. I have attached two simple scripts as an example of how I see this working. They would be called like so:
time vcztools view -G $LARGE_ZARR | python test_annotate.py | python test_ingest.py -z $LARGE_ZARR
1.71s
instead of
time vcztools view $LARGE_ZARR | python test_annotate.py > /dev/null
567.55s
Hi Jake!
Thanks for the explanation of your use case. I think I'd missed that the existing Zarr store is being updated - by adding a new field - and not being completely rewritten. For INFO field annotations (which are usually modest in size compared to FORMAT fields with many samples) this could be a good workflow. The -G
option certainly speeds up the time taken to do annotation (with VEP or whatever), so it would be good to add that as I mentioned before.
The ingest
command would be non-standard (compared to bcftools), so it could just be a Python script, or perhaps we could explore adding a plugin at some point if it's generally useful.
Hi Tom thanks for the comments, I will add the tests next.
I added a new commit that removes the format header lines when using -G
and removes the .
after the INFO fields.
This involved a small edit in lib/vcf_encoder.c
and lib
was in the .gitignore
is this not the correct place to make the edit?
This involved a small edit in
lib/vcf_encoder.c
andlib
was in the.gitignore
is this not the correct place to make the edit?
I think it is. Hopefully @jeromekelleher can take a look too.
This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?
This looks great - I don't really use .gitignore files, so they tend to be updated haphazardly, and can certainly be wrong in places.
This is an excellent update, thanks @JakeHagen!
The only thing I'm unsure about is whether having num_samples=0
and drop-genotypes
are identical semantically. If bcftools is given a VCF file with zero samples, but has a FORMAT header (which is "." on every variant) what does it output? Does the VCF spec have anything to say about whether FORMAT should be included when there are zero samples?
Is it possible to provide zero samples (i.e. an empty file) to bcftools? What does it output? Making sure these awkward corner cases are handled in the same way as bcftools is an important goal for compatibility.
Hi @jeromekelleher thanks for your comments.
I could not find anything in the VCF spec about including the FORMAT when there are no samples, but bcftools removes all format related things (header lines, FORMAT label columns, sample columns) from the output when -G
is used or an empty file is provided to -S
Bcftools will also error out when given a vcf with a FORMAT column and a .
for every variant.
I gave it a file with this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 69081 chr1_69081_G_C G C 36 . AF=0.000142;AQ=36;AC=0;AN=2 .
and it errored with this:
[E::bcf_hdr_parse_sample_line] Could not parse the "#CHROM.." line, either FORMAT is missing or spaces are present instead of tabs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
Failed to read from t.vcf: could not parse header
when passed through bcftools view
So far I think this PR handles everything the same as bcftools, but I will see if I can cover these cases in the tests
Excellent, thanks @JakeHagen. We should get some validation tests against bcftools for this, but that could be a follow up. Code LGTM.
Ok I added two simple tests, let me know if you would like to see any others. Thanks!
Hello
I am opening this draft pull request to see if there is interest in adding not only a drop_genotypes option to
view
but also a INFO field ingestion tool. I think this would be a great way to speed up annotation when using vcf-zarr with existing annotation tools. The workflow might look like this.Ive tested this workflow on a toy annotation example, it works well. Is there any reason this would not work more generally? The only I can think of is the number and ordering of the variants must be exactly the same coming back.
I am happy to work on both of these if there is interest.