malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Script to select variants and recode alleles for phasing #50

Closed alimanfoo closed 3 years ago

alimanfoo commented 4 years ago

This PR adds a script which performs site selection and allele recoding of genotypes from a single sample, whose genotypes are in zipped zarr format (output by the mosquito SNP genotyping pipeline).

The output is in VCF format and intended as input to read-backed phasing.

Includes a unit test based on the example provided here with correction here.

Part of addressing #44.

alimanfoo commented 4 years ago

Example usage:

$ python sample_select_variants.py \
    --sample-genotypes fixture/sample_genotypes.zarr.zip \
    --sites-called fixture/sites_called.zarr.zip \
    --sites-selected fixture/sites_selected.zarr.zip \
    --output output.vcf \
    --contig 2R

Can be run on multiple contigs by providing multiple values for the --contig argument.

alimanfoo commented 4 years ago

Hi @jessicaway, seeing as how there wasn't an existing tool to do this, I thought I'd go ahead and implement something using the tools we have for working with variation data in the python/numpy ecosystem.

This script does the site selection and allele recoding in one step, reading from zarr as input (produced by SNP genotyping pipeline) and writing to VCF as output. So hopefully this will simplify the pipeline a little, and should also be reasonably efficient.

The script also requires the sites called and the sites selected to be provided as zipped zarr files on the local file system. If you don't have these I can provide.

Hope this makes sense, comments very welcome.

cc @puethe, @kbergin

alimanfoo commented 4 years ago

Also now tested against real An. gambiae data, pushed a couple of small fixes.

I've uploaded the sites files in zipped zarr format to GCS in my malariagen-pipelines-dev bucket in case useful.

Here's the sites called:

Here's the sites selected (one for each of the three species combinations):

E.g., having downloaded these locally, running against one of the samples used for validation of the SNP genotyping pipeline:

$ python sample_select_variants.py --sample-genotypes ~/workspace/AA0052-C.zarr.zip --sites-called ~/workspace/AgamP4.allsites.nonN.zarr.zip --sites-selected ~/workspace/ag3_phased_sites_gamb_colu_arab.zarr.zip --output ~/workspace/output.vcf --contig 3L --progress
2020-10-15T11:38:14.776441 :: 3L begin
2020-10-15T11:38:24.662671 :: 3L 6,718,113 (1,000,000 rows)
2020-10-15T11:38:31.841631 :: 3L 11,989,287 (2,000,000 rows)
2020-10-15T11:38:39.121931 :: 3L 15,807,521 (3,000,000 rows)
2020-10-15T11:38:46.463254 :: 3L 19,346,099 (4,000,000 rows)
2020-10-15T11:38:53.855794 :: 3L 22,990,887 (5,000,000 rows)
2020-10-15T11:39:01.226410 :: 3L 27,105,258 (6,000,000 rows)
2020-10-15T11:39:08.608687 :: 3L 30,553,537 (7,000,000 rows)
2020-10-15T11:39:15.975852 :: 3L 34,031,604 (8,000,000 rows)
2020-10-15T11:39:23.315657 :: 3L 37,712,324 (9,000,000 rows)
2020-10-15T11:39:30.721934 :: 3L 40,938,120 (10,000,000 rows)
2020-10-15T11:39:33.088608 :: 3L done

One contig takes just over a minute to run, so whole genome would probably take 5 minutes or so.