qlu-lab / PUMAS

Fine-tuning polygenic risk score models using GWAS summary statistics
MIT License
41 stars 11 forks source link

Issues and suggestions pumas.main #2

Closed rkarlssonlinner closed 3 years ago

rkarlssonlinner commented 5 years ago

Hi,

First an issue:

  1. The following piece of code in the pumas.main() function breaks the function as no P-value column is given as an input parameter (unless the column by accident is already called Pval): p.value=data.real$Pval p.value=p.value[order(p.value,decreasing=F)]

Overall, this input is not well-described in the wiki:

Then, we need to select a p-value. In our approach, we first order all the p-value in ascending order and select those p-values that is less than 0.0001

Then some suggestions:

  1. Currently, it is unclear whether the function requires a column with per-SNP sample sizes or simply a single sample size value. It appears that a single value is required. But theoretically, per-SNP should be preferred. Alternatively, the script could use max(N) of a user-supplied column of per-SNP sample-sizes so that the script does not throw an error when a per-SNP sample-size column is specified as input by the user.

  2. You could implement fread() instead of read.table() to speed things up.

  3. Unclear if the script requires MAF or effect-coded allele frequency (EAF)? Often irrelevant because of MAF(1-MAF)=EAF(1-EAF), but it is sometimes important with respect to adjustment of effect sizes etc. Please explain more clearly to the user what is actually requested.

  4. The function should allow gzipped files as input.

  5. The title of the plot is hard-coded in the function FunI.Plot(): main=paste0("T0030_pruned")

qlu-lab commented 5 years ago

Thank you very much for your feedback and suggestions!

We just fixed the issue of the function not being able to read header for p-value and adjusted the main title for the plot. Still, we will revise the function and wiki page soon and refine some details as well.

Regarding your suggestions:

  1. This is a great suggestion. We can confirm that so far the function only recognizes a number N and we planned to include the 'per-SNP sample-size' feature very soon. Instead of using max(N) for the user-supplied column of per-SNP sample-sizes, our framework can actually take into account per-SNP sample-sizes. Our previous analysis showed that either using max(N) or per-SNP sample-size as input will give robust estimates in terms of optimal p-value cutoff. If you are interested, the detailed results are under 'Some technical considerations' in our preprint.

  2. Yes we'll do that too. That would speed up our function.

  3. You are right, both MAF and EAF are acceptable, because what the method needs is per-SNP variance. We'll make changes in the tutorial/wiki to make this clear to users.

  4. Both read.table() and fread() should be able to read '.gz' files.

  5. Now we changed the title to be user-specified input_path. We will further refine the title soon.

Overall, these are great suggestions. We appreciate all of them and we'll make sure to implement all these features WITHIN THIS WEEK. Feel free to let us know if our comments make sense and any other feedbacks are welcome too!


Zijie Zhao