Closed alimanfoo closed 2 years ago
cc @gbggrant, @jessicaway, @kbergin
Re getting the input files (BAMs and zipped zarrs) into GCS, I'm thinking to copy them to our vo_agam_release bucket, which is the bucket we use for publicly releasing all data from Ag 3.0. That is a public bucket, so should be accessible to you for the phasing run, and would also be accessible to anyone else who wanted to work from the BAMs or zipped zarrs within Google Cloud. Does that sound OK?
That sounds good to me. Thanks @alimanfoo!
OK, I think all the inputs are in place ready for the phasing runs. Here's some info on where everything is.
Note that we want to perform three separate runs of the phasing pipeline, each with a different setup. I will call these three runs "gamb_colu_arab", "gamb_colu" and "arab". Here's an explanation of each of these runs:
Note that the arab run has the least number of samples (368), so you may want to start with this to get a sense of compute resources and cost, then move on to gamb_colu (2445 samples) then finally gamb_colu_arab (2814 samples).
The sample manifests for the three phasing runs are on GCS. Each file is just a text file, one line per sample ID. Files are at the following URLs:
The sets of sites and alleles to be phased are at the following URLs:
BAM files for all samples are in the vo_agam_release bucket. A CSV file containing a mapping from sample identifiers to GCS URLs is available here:
gs://vo_agam_release/v3/alignments/catalog.csv
Zipped zarr files with SNP genotypes for each sample are in the vo_agam_release bucket. A CSV file containing a mapping from sample identifiers to GCS URLs is available here:
gs://vo_agam_release/v3/snp_genotypes/per_sample/catalog.csv
I've created new recombination map files, one for each chromosome arm. These should be in the shapeit gmap format, but let me know if any problems. Files are here:
The same recombination maps can be used for all three phasing runs.
I think that's all the inputs but let me know if I've missed anything :-) Cc @gbggrant @jessicaway @kbergin @seretol.
Hi @jessicaway, re sample metadata for terra, here is a notebook illustrating how to load sample metadata for all samples in Ag1000G phase 3. Also I'll attach the CSV export here: ag3_samples.csv.
Thanks @alimanfoo! I will get this added to the workspace
Surfacing discussion about next steps from this week's call.
We've decided to make changes for the gamb_colu_arab run to see if we can reduce costs and get predictable runtimes and resource requirements, by (1) using intervals with a fixed number of SNPs, and (2) upgrading SHAPEIT4 to version 4.2.1.
Plan:
@jessicaway, @kbergin, @gbggrant does that look right to you?
Here's the gamb_colu runtime plotted against the number of sites:
Based on this I'm tempted to go for 200k sites intervals, there should still be a decent amount of information there in each interval, and runtimes should be well within 24hrs, so we could use preemptibles.
Hi @jessicaway, I've created intervals files for 200k SNPs intervals with 40k SNPs overlap. Here are the intervals for gamb_colu_arab:
When choosing a few intervals to use for testing and investigating resource requirements, I'd suggest to pick an interval from somewhere near the middle of each chromosome, rather than the ends. I am hoping that the memory requirements will be nearly constant across all intervals, but the runtime may vary a bit if there is a lot more recombination (and hence less haplotype sharing) in some intervals versus others. By picking intervals near the middle of chromosomes, that should avoid the regions of low recombination near the centromere or telomere, and so give us a better sense of what the maximum resource requirements should be. I'd suggest to do a test run on maybe three intervals, from different chromosomes?
@alimanfoo Thanks so much for getting these ready! I'm hoping to get these test going early next week
Hi @jessicaway, thanks so much for sharing the new shapeit4 runtimes. The results there look great, it looks like we should be able to use preemptibles, and we can reduce memory a lot too.
@alimanfoo, of course! We are also very excited by the new results! We will add the preemptibles and back off the memory (and probably reduce the disk as well). I expect this run will be much cheaper
Hi @jessicaway,
Here are the intervals for the gamb_colu phasing run:
Here are the intervals for the arab phasing run:
Hi @jessicaway, sorry it took a bit longer than expected to check the gamb_colu_arab
outputs. There was an unexpected thing where the output haplotypes for the crosses samples were not fully concordant with the input genotypes, which had me stumped for a while. Eventually I realised that these samples were not included when I ascertained the biallelic sites. I think what is happening is that in these samples, there are some additional alleles at the phasing sites, and so genotypes get converted to missing at the start of the pipeline, but then shapeit4 imputes missing genotypes, and so fills them in with something different. In any case I think this is expected behaviour, and everything else about the outputs looks good, so please go ahead with the gamb_colu
and arab
phasing runs.
Thanks @alimanfoo! I will run the gamb_colu
and arab
sample sets as discussed
Closing as completed :cherry_blossom:
Issue to track production run of the mosquito phasing pipeline to build haplotypes for the Ag 3.0 (a.k.a. Ag1000G phase 3) data release.