quansun98 / GAUDI

GAUDI: a penalized regression based PRS method designed specifically for admixed individuals
0 stars 0 forks source link

Run time and scalability #2

Open HelenYSLin opened 2 months ago

HelenYSLin commented 2 months ago

Hi, it's me again with some different questions. I've been running GAUDI on chromosome 22 with a 2-way African American sample set, but the training step is taking too long. It took me over a day to train with ~15K samples and ~15K variants using cloud computing, and I ended up canceling the job because I wasn't sure how long it would take. I then reduced the sample size from 15K to 1.8K, and it took about 11.5 hours to finish.

I'm wondering if there are any tips to reduce the run time. I tried setting --sparsity TRUE, but it stopped working at some point for unknown reasons. Would narrowing the range of --start-p-exp and --end-p-exp help? I'm mainly interested in knowing if GAUDI can scale up to run all chromosomes and how long it might take.

Thanks.

quansun98 commented 2 months ago

Thanks for the question. Yes narrowing the range of p-values will reduce the number of grids and thus reduce computational time. We are aware of the computational bottleneck of GAUDI, since taking individual level data and fitting fused lasso models both are computational heavy. An ad-hoc solution is to set stringent p-value threshold to filter out less significant variants. We apologize for the inconvenience it may cause, and are currently working on some model and code tweaks. We will update when it’s ready. Thanks again for your interests in our method!

On Jul 25, 2024, at 6:19 PM, Helen Lin @.***> wrote:

Hi, it's me again with some different questions. I've been running GAUDI on chromosome 22 with a 2-way African American sample set, but the training step is taking too long. It took me over a day to train with ~15K samples and ~15K variants using cloud computing, and I ended up canceling the job because I wasn't sure how long it would take. I then reduced the sample size from 15K to 1.8K, and it took about 11.5 hours to finish.

I'm wondering if there are any tips to reduce the run time. I tried setting --sparsity TRUE, but it stopped working at some point for unknown reasons. Would narrowing the range of --start-p-exp and --end-p-exp help? I'm mainly interested in knowing if GAUDI can scale up to run all chromosomes and how long it might take.

Thanks.

— Reply to this email directly, view it on GitHubhttps://github.com/quansun98/GAUDI/issues/2, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM74NIMIW2NZQNZGIAU5BMDZOF2XXAVCNFSM6AAAAABLPM27T6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTCMBWGQ2TQNI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

HelenYSLin commented 2 months ago

Thanks for your reply. Restricting the P-value range to a small set of variants did help reduce run time and is expected to be feasible to apply to the whole genome. Now, my other question is: would you suggest running GAUDI with all chromosomes together, looping over each chromosome, or running each chromosome in parallel? I am wondering if I run each chromosome separately, should I apply different P-values and parameters from the trained chromosome to their corresponding testing chromosome, or should the parameters remain the same across the genome? How is it supposed to work out?

quansun98 commented 2 months ago

Thanks for the question. GAUDI was designed to fit a multiple-variant regression model, thus it works best when fitting on all chromosomes together. In other words, all the genetic variants genome-wide in a specific p-value range should be considered when fitting the model to ensure better accuracy. Hope this helps!

HelenYSLin commented 1 month ago

Thanks so much for your always prompt replies! I've tried running GAUDI on the whole genome a couple of times. However, the R² in the last two runs dropped dramatically compared to previous ones. In the last run, I noticed that only one or two SNPs remained after training that matched the SNPs in the chunk files, as the messages printed during model testing included 200+ lines of:

"No fitted SNPs in partial PRS for file out_chr{a number}_chunk{a number}.la_dosage.tsv.gz"

I checked the RDS file and realized that only 2 SNPs (4 local ancestries for 2-way admixtures) passed the best model p-value threshold, which was 5e-06.

Here is the code I used for training:

Rscript ./GAUDI/fit_cv_fused_lasso.R \
    --gaudi-path ./GAUDI \
    --gwas ./data/PGC_final_05_allChr.txt \
    --gwas-col-id ID --gwas-col-p P --gwas-log-p FALSE \
    --la ./data/out_allChr.la_dosage.mtx.gz \
    --col ./data/out_allChr.la_dosage.colnames \
    --row ./data/out_allChr.la_dosage.rownames \
    --pheno ./data/pheno.tsv --pheno-name is_case --pheno-iid person_id \
    --start-p-exp -3 --end-p-exp -8 \
    --seed 2024 --sparsity FALSE \
    --out ./data/train_p3_8_model

And here the main message from the training step:

Joining with by = join_by(IID) [1] "Reading in GWAS file." [1] "Done." [1] "Sparse Matrix found; using cosine similarity measure to remove highly similar columns." [1] 13770 7476 [1] "Getting lambdas from initial genlasso call for gamma value 0" [1] "Done!" [1] "Fitting fold 1" [1] "Fitting fold 2" [1] "Fitting fold 3" [1] "Fitting fold 4" [1] "Fitting fold 5" [1] 13770 614 [1] "Getting lambdas from initial genlasso call for gamma value 0" [1] "Done!" [1] "Fitting fold 1" [1] "Fitting fold 2" [1] "Fitting fold 3" [1] "Fitting fold 4" [1] "Fitting fold 5" [1] 13770 62 [1] "Getting lambdas from initial genlasso call for gamma value 0" [1] "Done!" [1] "Fitting fold 1" [1] "Fitting fold 2" [1] "Fitting fold 3" [1] "Fitting fold 4" [1] "Fitting fold 5" [1] 13770 4 [1] "Getting lambdas from initial genlasso call for gamma value 0" [1] "Done!" [1] "Fitting fold 1" [1] "Fitting fold 2" [1] "Fitting fold 3" [1] "Fitting fold 4" [1] "Fitting fold 5" [1] 13770 0 [1] "Not enough variants achieve marginal significance threshold at p=5e-07." [1] 13770 0 [1] "Not enough variants achieve marginal significance threshold at p=5e-08." There were 30 warnings (use warnings() to see them) [1] 0.0006396262 best model CV R2: 0.00063962623337007 best model p-value threshold: 5e-06

Is there anything wrong or unusual?

Additionally, my PRS distribution plot shows three peaks for cases and six peaks for controls. Could this be due to only 2 or 4 non-zero posterior effect sizes being used to calculate the PRS?

I also observed that when I increase the number of variants or samples in both model training and testing, the best model p-value threshold tends to become stricter. I'm not sure if this is expected; however, the increased N does not improve PRS performance and can even produce odd distributions. I just wanted to check if I did anything wrong or if this is just the way it is. Thank you

quansun98 commented 1 month ago

There may be multiple reasons of low R2. Please provide more information if possible: what's the difference between your last two runs compared to previous? Were they different phenotypes? I noticed you may have binary traits labeled as "is_case". Please be advised that it would be more appropriate to first adjust for any covariate effects (e.g., age, sex, etc.) using logistic regression models, and using residuals as the input phenotype for GAUDI.

From the log file, and the output RDS file, yes it seems there were only 4 predictors (2 variants with 2 ancestries) in the model. With such a small number, it's not surprised to get the weird distribution with peaks since there were only these different combinations of genotypes. So the odd distributions with increased N may arise from the fewer predictors. But the relationship between sample size and best model p-values should not be unidirectional. We did not observe any clear patterns in our experiments. It may be phenotype-dependent. Again, please avoid directly using binary phenotypes.

An ad-hoc strategy to avoid the small number of predictors is to set the --end-p-exp as -5. More stringent p-values may result in smaller number of variants and thus unstable models.

Not sure if these help. Feel free to send more details!

JasonTan-code commented 1 month ago

Thanks, Quan, that's an interesting point! I would imagine we first regress out the covariates for both the dependent variable and genotype, following Frisch–Waugh–Lovell theorem. This operation will make the original genotype and phenotype continuous, and control for known covariates. However, residualized ancestry-specific genotype dosage seems like a non-trivial problem for two reasons: 1) how do we store the genotype residuals; 2) the genotype is 2 dimensional (one for AFR, one for EUR), and how do we residualized a 2d matrix?

quansun98 commented 1 month ago

Yes that's a good point. You are completely right that the most correct way is to regress out the covariates for both the dependent variable and genotype. But in rich GWAS literature, a common practice is to just regress out covariates for the dependent variable, due to high computational cost of the same procedure for each genetic variant genome-wide. It assumes there is no associations between genetic variant and the covariates. Thus we didn't consider regressing out covariates for ancestry-specific dosages at all. Hope it makes sense!

HelenYSLin commented 1 month ago

I ran GAUDI for the same binary phenotype genome-wide three times. The first time was with ~900 each for cases and controls, but I realized my MSP files were incomplete. The second time I fixed the MSP files, so had a few more variants and ran the analysis with the same samples. The third time I performed further QC on the samples and included all controls (~800 cases and ~13,000 controls). The number of variants remained the same as in the second run. Otherwise, the pipeline was identical for all three runs.

I intended to, but forgot to ask about the inclusion of covariates previously—thank you for bringing this up. I didn’t regress out the covariates before running GAUDI, but I will do so in the next run.

Thank you for your diagnosis regarding the poor PRS performance in my data. I am focusing on a psychiatric phenotype, which is known for having a large proportion of non-genetic components. So I’m not surprised that my best pseudo R² achieved so far is less than 1%. I’m just trying to ensure that every mistake is fixed before running GAUDI again since it usually takes a long time to run, and we want to be budget-efficient.

Thanks again for your advice—both for preprocessing the binary phenotype and for setting less stringent p-value thresholds. I hope the results improve after these adjustments!