populationgenomics / saige-tenk10k

Hail batch pipeline to run SAIGE on CPG's GCP
MIT License
0 stars 0 forks source link

get_genotype_vcf.py review #71

Open katiedelange opened 7 months ago

katiedelange commented 7 months ago

Some points to consider in get_genotype_vcf.py

MattWellie commented 7 months ago

I believe the current checkpointing approach is not going to save you anything

Just chiming in to say that this script might be a bit checkpoint-heavy at the moment out of an abundance of caution. The MT densification step is quite a lot of processing, but the filtering between these two checkpoints is pretty minor - the first checkpoint can probably be safely removed

The key here is checkpointing prior to the LD pruning method call, as that includes a chunk more sample processing with no checkpoints, which was causing Hail/Java OOM errors. We added an overly conservative number of checkpoints, and if time allows this can/should be pared down to the minimum number of checkpoints to get the job done.

The checkpoint_mt method does perform the 'if it exists, read, otherwise write' check

katiedelange commented 7 months ago

The checkpoint_mt method does perform the 'if it exists, read, otherwise write' check

I was actually meaning the line above, but that's where my hail is rusty re: when things get evaluated. My interpretation of the code is that we always take the VDS and split out multiallelics + densify, but then we only save that result to a checkpoint if it doesn't already exist (and in fact we go ahead and replace the matrixtable we just spent a bunch of time calculating with the old checkpoint data if it does exist). But does the hail evaluation magic save us here?

MattWellie commented 7 months ago

Ah sorry... yes, or at least that's my understanding. Everything including the densification is done using lazy-loading AFAIK, but admittedly I've not spent much time with VDSs compared with MTs

annacuomo commented 7 months ago

Hi @katiedelange just going through this now, I'll try to respond to this one point a the time! Also as I mentioned to you on Slack I am adding TO DOs to improve the pipeline to this GDoc linked in the repo README.

Could you briefly explain the logic behind this sampling approach / add it to the README?

The logic for this is simply that the sample rows method (as far as I can tell) expects a proportion of values to retain, vs the actual number, so for example you can say randomly select 20% (0.2) of variants, not randomly select 1,000 variants.

So I am getting to my desired vre_n_markers by approximately dividing that number by the total number (multiplied by 1.1 to make sure I am in excess of it) and then take the first vre_n_markers via head to get to the precise number.

There might be a better way to do this but this was what I came up with at the time :)

Probably enough to describe this briefly in the comment above this line? Or I can add to the README if you think it's better

annacuomo commented 7 months ago

@katiedelange

Extracting the chr values from the bim (used to clean up the variants selected for variance ratio estimation) will leave empty values for the X, Y and MT chromosomes. However, you don't explicit exclude these chromosomes when creating the initial variant subset, and so you may end up with missing chr data being passed to SAIGE. I'd recommend either excluding these chromosomes from variant selection (if they are potentially going to skew variance ratio estimation), or adjusting your regex to (\d+|X|Y|MT).

This is an excellent point and I should've thought about this harder! I don't have a clear sense of whether variants from these chromosomes may skew the VRE calculation or not! I will discuss with Wei and then either remove these prior to subsetting or adjust the regex (again if I don't get around to this I have added this to the TO DO list so someone can potentially take this on later instead)

annacuomo commented 7 months ago

@katiedelange

You may wish to predicate the downsampling to 1% based on the number of variants left, to avoid accidentally downsampling to less than 2000 variants?

Yup another good point and something I have actually already encountered when testing (not many variants left when filtering for MAC>20 with 5 individuals haha), I have added this to the TO DO.

annacuomo commented 7 months ago

@katiedelange

Just noting that you use MAF > Threshold. It's done consistently in the code and README, so is fine, but I feel like most studies I've seen include the lower threshold in their tested range (i.e. keep MAF >= threshold).

Mmmh yes I noticed I changed this in this recent PR where I extract rare variants as well! Is this what you'd do, define common variants as MAF>=T, vs rare as MAF<T? Happy to do whatever you think makes most sense, and update readme accordingly.

annacuomo commented 7 months ago

@katiedelange

Ideally r2 and bp window would be parameters you can adjust. Why did you choose these values in the first instance (NB they seem fine to me), and can you add that information to the README?

I seem to have just used the parameters used in the method's description but I have added turn these into args to the TO DO list.

Also a bit random but just ran into this issue highlighting how long this step takes in Hail compared to Plink, which we have also observed. cc @MattWellie

annacuomo commented 7 months ago

@katiedelange and finally I think @MattWellie is better placed to respond to the checkpoint question, I definitely do not understand this well enough to have an opinion, but happy to submit a PR to change the code / remove checkpoints if appropriate.

annacuomo commented 1 week ago

@katiedelange this PR: https://github.com/populationgenomics/saige-tenk10k/pull/144 should hopefully address your comments from here?