sgkit-dev / vcztools

Partial reimplementation of bcftools for VCF Zarr
Apache License 2.0
4 stars 3 forks source link

Drop Genotypes #61

Closed JakeHagen closed 2 months ago

JakeHagen commented 3 months ago

Hello

I am opening this draft pull request to see if there is interest in adding not only a drop_genotypes option to view but also a INFO field ingestion tool. I think this would be a great way to speed up annotation when using vcf-zarr with existing annotation tools. The workflow might look like this.

vcztools view -G 100k_sample.vcz | VEP | vcztools ingest --infos CSQ 100k_sample.vcz

Ive tested this workflow on a toy annotation example, it works well. Is there any reason this would not work more generally? The only I can think of is the number and ordering of the variants must be exactly the same coming back.

I am happy to work on both of these if there is interest.

tomwhite commented 3 months ago

Thanks for this @JakeHagen! I think -G/--drop-genotypes would be a nice addition, so we'd be happy to accept a PR for that.

I'm not sure that the VEP pipeline fits with the processing model we are trying to encourage here. This tool (vcztools) is for streaming small parts of the VCF Zarr store, it's not really for processing the full set of variants in the store. For that we have sgkit which can do out-of-core parallel processing.

Having said that, I think we'd be keen to understand better what full pipeline you are proposing here is used for. What would you do with the 100k_sample.vcz file in your example? Would it be used to select a subset of variants from the original store and then apply some analysis to them?

We have had various discussions about how to integrate VEP in the past (see https://github.com/sgkit-dev/sgkit/pull/1083, and https://github.com/search?q=repo%3Asgkit-dev%2Fsgkit+vep&type=issues more generally), so I think we'd like to accommodate it somehow in the sgkit-dev ecosystem.

Tagging @jeromekelleher and @benjeffery who may have more to say about this (although Jerome is away at the moment).

JakeHagen commented 3 months ago

Hi Tom

Thanks for the reply.

How I envision this working in our lab is the data would be joint called -> ingested into Zarr -> annotated -> subset by lab members for different analysis.

Since Im guessing very few annotation tools will operate directly on the Zarr format, annotation would need to be done before ingestion.

Having an info ingestion feature seems like it would:

These features would solve a lot of the pain points our lab has around annotation.

I am still learning about the Zarr format and sgkit ecosystem so maybe this is not practical or made pointless by a different tool but it seems like it could be a valuable feature of Zarr vcf to me. I have attached two simple scripts as an example of how I see this working. They would be called like so:

time vcztools view -G $LARGE_ZARR | python test_annotate.py | python test_ingest.py -z $LARGE_ZARR

1.71s

instead of

time vcztools view $LARGE_ZARR | python test_annotate.py > /dev/null

567.55s

zarr_annotate_ingest_sample_scripts.zip

tomwhite commented 3 months ago

Hi Jake!

Thanks for the explanation of your use case. I think I'd missed that the existing Zarr store is being updated - by adding a new field - and not being completely rewritten. For INFO field annotations (which are usually modest in size compared to FORMAT fields with many samples) this could be a good workflow. The -G option certainly speeds up the time taken to do annotation (with VEP or whatever), so it would be good to add that as I mentioned before.

The ingest command would be non-standard (compared to bcftools), so it could just be a Python script, or perhaps we could explore adding a plugin at some point if it's generally useful.

JakeHagen commented 3 months ago

Hi Tom thanks for the comments, I will add the tests next.

I added a new commit that removes the format header lines when using -G and removes the . after the INFO fields. This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?

tomwhite commented 3 months ago

This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?

I think it is. Hopefully @jeromekelleher can take a look too.

jeromekelleher commented 3 months ago

This involved a small edit in lib/vcf_encoder.c and lib was in the .gitignore is this not the correct place to make the edit?

This looks great - I don't really use .gitignore files, so they tend to be updated haphazardly, and can certainly be wrong in places.

jeromekelleher commented 3 months ago

This is an excellent update, thanks @JakeHagen!

The only thing I'm unsure about is whether having num_samples=0 and drop-genotypes are identical semantically. If bcftools is given a VCF file with zero samples, but has a FORMAT header (which is "." on every variant) what does it output? Does the VCF spec have anything to say about whether FORMAT should be included when there are zero samples?

Is it possible to provide zero samples (i.e. an empty file) to bcftools? What does it output? Making sure these awkward corner cases are handled in the same way as bcftools is an important goal for compatibility.

JakeHagen commented 3 months ago

Hi @jeromekelleher thanks for your comments. I could not find anything in the VCF spec about including the FORMAT when there are no samples, but bcftools removes all format related things (header lines, FORMAT label columns, sample columns) from the output when -G is used or an empty file is provided to -S Bcftools will also error out when given a vcf with a FORMAT column and a . for every variant. I gave it a file with this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
chr1    69081   chr1_69081_G_C  G       C       36      .       AF=0.000142;AQ=36;AC=0;AN=2     .

and it errored with this:

[E::bcf_hdr_parse_sample_line] Could not parse the "#CHROM.." line, either FORMAT is missing or spaces are present instead of tabs:
        #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT

Failed to read from t.vcf: could not parse header

when passed through bcftools view

So far I think this PR handles everything the same as bcftools, but I will see if I can cover these cases in the tests

jeromekelleher commented 3 months ago

Excellent, thanks @JakeHagen. We should get some validation tests against bcftools for this, but that could be a follow up. Code LGTM.

JakeHagen commented 2 months ago

Ok I added two simple tests, let me know if you would like to see any others. Thanks!