malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Ag3.0 phasing pipeline production run #80

Closed alimanfoo closed 2 years ago

alimanfoo commented 3 years ago

Issue to track production run of the mosquito phasing pipeline to build haplotypes for the Ag 3.0 (a.k.a. Ag1000G phase 3) data release.

alimanfoo commented 3 years ago

cc @gbggrant, @jessicaway, @kbergin

Re getting the input files (BAMs and zipped zarrs) into GCS, I'm thinking to copy them to our vo_agam_release bucket, which is the bucket we use for publicly releasing all data from Ag 3.0. That is a public bucket, so should be accessible to you for the phasing run, and would also be accessible to anyone else who wanted to work from the BAMs or zipped zarrs within Google Cloud. Does that sound OK?

jessicaway commented 3 years ago

That sounds good to me. Thanks @alimanfoo!

alimanfoo commented 3 years ago

OK, I think all the inputs are in place ready for the phasing runs. Here's some info on where everything is.

Note that we want to perform three separate runs of the phasing pipeline, each with a different setup. I will call these three runs "gamb_colu_arab", "gamb_colu" and "arab". Here's an explanation of each of these runs:

Note that the arab run has the least number of samples (368), so you may want to start with this to get a sense of compute resources and cost, then move on to gamb_colu (2445 samples) then finally gamb_colu_arab (2814 samples).

Sample manifests

The sample manifests for the three phasing runs are on GCS. Each file is just a text file, one line per sample ID. Files are at the following URLs:

Sites

The sets of sites and alleles to be phased are at the following URLs:

Alignments (BAM files)

BAM files for all samples are in the vo_agam_release bucket. A CSV file containing a mapping from sample identifiers to GCS URLs is available here:

gs://vo_agam_release/v3/alignments/catalog.csv

Genotypes (zipped zarr files)

Zipped zarr files with SNP genotypes for each sample are in the vo_agam_release bucket. A CSV file containing a mapping from sample identifiers to GCS URLs is available here:

gs://vo_agam_release/v3/snp_genotypes/per_sample/catalog.csv

Recombination maps

I've created new recombination map files, one for each chromosome arm. These should be in the shapeit gmap format, but let me know if any problems. Files are here:

The same recombination maps can be used for all three phasing runs.


I think that's all the inputs but let me know if I've missed anything :-) Cc @gbggrant @jessicaway @kbergin @seretol.

alimanfoo commented 3 years ago

Hi @jessicaway, re sample metadata for terra, here is a notebook illustrating how to load sample metadata for all samples in Ag1000G phase 3. Also I'll attach the CSV export here: ag3_samples.csv.

jessicaway commented 3 years ago

Thanks @alimanfoo! I will get this added to the workspace

alimanfoo commented 3 years ago

Surfacing discussion about next steps from this week's call.

We've decided to make changes for the gamb_colu_arab run to see if we can reduce costs and get predictable runtimes and resource requirements, by (1) using intervals with a fixed number of SNPs, and (2) upgrading SHAPEIT4 to version 4.2.1.

Plan:

@jessicaway, @kbergin, @gbggrant does that look right to you?

alimanfoo commented 3 years ago

Here's the gamb_colu runtime plotted against the number of sites:

image

Based on this I'm tempted to go for 200k sites intervals, there should still be a decent amount of information there in each interval, and runtimes should be well within 24hrs, so we could use preemptibles.

alimanfoo commented 3 years ago

Hi @jessicaway, I've created intervals files for 200k SNPs intervals with 40k SNPs overlap. Here are the intervals for gamb_colu_arab:

When choosing a few intervals to use for testing and investigating resource requirements, I'd suggest to pick an interval from somewhere near the middle of each chromosome, rather than the ends. I am hoping that the memory requirements will be nearly constant across all intervals, but the runtime may vary a bit if there is a lot more recombination (and hence less haplotype sharing) in some intervals versus others. By picking intervals near the middle of chromosomes, that should avoid the regions of low recombination near the centromere or telomere, and so give us a better sense of what the maximum resource requirements should be. I'd suggest to do a test run on maybe three intervals, from different chromosomes?

jessicaway commented 3 years ago

@alimanfoo Thanks so much for getting these ready! I'm hoping to get these test going early next week

alimanfoo commented 3 years ago

Hi @jessicaway, thanks so much for sharing the new shapeit4 runtimes. The results there look great, it looks like we should be able to use preemptibles, and we can reduce memory a lot too.

jessicaway commented 3 years ago

@alimanfoo, of course! We are also very excited by the new results! We will add the preemptibles and back off the memory (and probably reduce the disk as well). I expect this run will be much cheaper

alimanfoo commented 3 years ago

Hi @jessicaway,

Here are the intervals for the gamb_colu phasing run:

Here are the intervals for the arab phasing run:

alimanfoo commented 3 years ago

Hi @jessicaway, sorry it took a bit longer than expected to check the gamb_colu_arab outputs. There was an unexpected thing where the output haplotypes for the crosses samples were not fully concordant with the input genotypes, which had me stumped for a while. Eventually I realised that these samples were not included when I ascertained the biallelic sites. I think what is happening is that in these samples, there are some additional alleles at the phasing sites, and so genotypes get converted to missing at the start of the pipeline, but then shapeit4 imputes missing genotypes, and so fills them in with something different. In any case I think this is expected behaviour, and everything else about the outputs looks good, so please go ahead with the gamb_colu and arab phasing runs.

jessicaway commented 3 years ago

Thanks @alimanfoo! I will run the gamb_colu and arab sample sets as discussed

alimanfoo commented 2 years ago

Closing as completed :cherry_blossom: