pjgreer / ukb-rap-tools

Scripts and workflows for use analyzing UK Biobank data from the DNANexus Research Analysis Platform
41 stars 8 forks source link

Some question in WES #7

Closed wojiaolianer closed 1 year ago

wojiaolianer commented 1 year ago

Hello Phil Greer, I have a question with your pipeline: In regenie step1, your input file is filtered high quality plink file, and unprocessed plink file were inputed in regenie step2. Is there with different variants in both steps? And it has any impact on result? Many literatures not described methods in detail, I have no idea with it but ask for your help.

Another, an error in runing burden test with UKB RAP: "Detected 1 masks with unknown annotations". I find someone with same problem in github, with the instruction, I have already convert masks file to LF line break (Unix) and confirm all masks present in anno file, but still get same error. I have tried for dozens of times, but it failed to run. What's more, I can run test data in linux servers sucessfully, I don't know what went wrong.

pjgreer commented 1 year ago

For a Regenie analysis, The first step is to create a smaller (~500K snps) genome wide dataset from the hard genotype data. This file will generate a "block structure" that will be used in the larger genome wide analysis. These blocks are fed as a covariate of the linear model in stage 2. The fact that the markers differ between the two stages is intentional. Specifically, data from stage1 is a subset of the entire imputed dataset, while a portion of the stage1 dataset intersects with the WES dataset.

I will look into the rare variant analysis..

pjgreer commented 1 year ago

Did you happen to just upload the reflat38.zip file, or did you uncompress the zip file and upload all the internal gz files?

I rewrote the instructions that you have to uncompress the .zip file.and upload the 24 .gz files. rvtest cannot load the .zip file on its own.

I ask because it is running just fine for me. and I realized that I skipped a step in the readme file.

wojiaolianer commented 1 year ago

For a Regenie analysis, The first step is to create a smaller (~500K snps) genome wide dataset from the hard genotype data. This file will generate a "block structure" that will be used in the larger genome wide analysis. These blocks are fed as a covariate of the linear model in stage 2. The fact that the markers differ between the two stages is intentional. Specifically, data from stage1 is a subset of the entire imputed dataset, while a portion of the stage1 dataset intersects with the WES dataset.

I will look into the rare variant analysis..

Thanks, I recently test different database for step1 which is filter by different parameter and it's impact to step2 result. And I found that filter parameter --maf 0.01 and 0.0001 for rare variants association analysis is vary widely. I wonder there are appropriate parameters for rare variants analysis.

pjgreer commented 1 year ago

I ran my rvtest scripts and everything is working on my end. So I think it may just be I wrote poor instructions. please let me know if uploading the 23 *.gz files fixed that error.

For rare variant analysis, you do not want a MAF cutoff when prepping your WES data. In a rare variant analysis, you are counting the number of rare variants in a gene in cases vs controls. Filtering your analysis data with a MAF cutoff will reduce the total number of rare variants that you can analyze. You only need to filter WES with a MAF (and maf-max) cutoff when performing GWAS.

You do not use a MAF filter for QC on the dataset used in the STEP 2 section of REGENIE when you plan to run rare variant analysis. You only filter the Step2 data based on MAF if you are running a GWAS.

pjgreer commented 1 year ago

I am closing this thread