rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
182 stars 53 forks source link

SNP selection in step 1 #116

Closed AnnaFurtjes closed 3 years ago

AnnaFurtjes commented 3 years ago

Thank you for creating this great tool!

I have a question regarding SNP selection in step 1. I am using a snplist containing 587,583 SNPs to indicate to regenie with the --extract command which SNPs to keep. It runs fine but the log file indicates that it only considers 571,257 SNPs. I wondered if you had an answer as to why regenie excludes ~16,000 SNPs that originally survived my quality control? Does it perform checks on SNP IDs in the background?

Thanks so much for taking the time to read this!

joellembatchou commented 3 years ago

Hi,

Regenie does not perform additional QC checks for step 1, it will only look in the genotype file for variants whose IDs match those in the --extract file.

One way to better assess what is going on would be to use (example with PLINK BED where 2nd column in .bim is variant ID)

grep -wFf extract.file  <( cut -f2 geno.bim ) | wc -l

where 'geno.bim' corresponds to set of PLINK files you pass to Regenie and 'extract.file' is the file you pass to --extract in Regenie (for PGEN you would use 3rd column in .pvar and for BGEN you could use a .bgi index file to get variant IDs).

Does that show "587,583" or "571,257" ?

AnnaFurtjes commented 3 years ago

Hi, Thanks so much for getting back to me about this!

It does indeed show 571,257. I will look into why my geno.bim file is missing those SNPs. Thanks!