opain / eQTL_to_TWAS

Methods for converting eQTL summary statistics into TWAS SNP-weights
GNU General Public License v3.0
13 stars 4 forks source link

Error: No variants remaining after --extract. Error in fread(paste0(opt$output, "/", gene_i, "/ref/ref_chr", chr_i, : #10

Closed 1667857557 closed 1 year ago

1667857557 commented 1 year ago

Hi Oliver,

We encountered an unexpected issue when using the local file and the original code from your post. We split the full summary cis-eQTL dataset by the gene names and ran it one by one. It works well in most genes, but in some genes, it reports an error like below, Can you please provide me with some advice on how to address this issue? Thanks in advance.

Huang

_Analysis started at 2023-09-22 10:07:37 Sumstats contain 3894 variant-gene associations. Sumstats contain 3894 unique variants. Sumstats contain 1 unique genes After extracting variants, sumstats contain 669 variant-gene associations. After extracting variants, sumstats contain 669 unique variants. After filtering variants with MAF < 1%, sumstats contain 669 variant-gene associations. ref_keep used to subset reference genotype data. PLINK v1.90b7 64-bit (16 Jan 2023) www.cog-genomics.org/plink/1.9/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to ./eQTL_to_TWAS/eQTL_to_TWAS/ENSG00000166535.txt/ENSG00000166535/ref/ref_chr12.log. Options in effect: --bfile ./Trans-Phar/1000G_EUR_Phase3_plink/1000G.EUR.QC.12 --extract ./eQTL_to_TWAS/eQTL_to_TWAS/ENSG00000166535.txt/ENSG00000166535/ref/ref.snplist --keep ./EUR_1KG_phase3_samples.tsv --make-bed --out ./eQTL_to_TWAS/eQTL_to_TWAS/ENSG00000166535.txt/ENSG00000166535/ref/ref_chr12

7944 MB RAM detected; reserving 3972 MB for main workspace. 480110 variants loaded from .bim file. 489 people (0 males, 0 females, 489 ambiguous) loaded from .fam. Ambiguous sex IDs written to ./eQTL_to_TWAS/eQTL_to_TWAS/ENSG00000166535.txt/ENSG00000166535/ref/ref_chr12.nosex . Loading required package: bigstatsr Error: No variants remaining after --extract. Error in fread(paste0(opt$output, "/", gene_i, "/ref/ref_chr", chr_i, : File './eQTL_to_TWAS/eQTL_to_TWAS/ENSG00000166535.txt/ENSG00000166535/ref/ref_chr12.bim' does not exist or is non-readable. getwd()=='/mnt/g/linux' In addition: Warning messages: 1: In dir.create(paste0(opt$output, "/", gene_i, "/ref"), recursive = T) : './eQTL_to_TWAS/eQTL_to_TWAS/ENSG00000166535.txt/ENSG00000166535/ref' already exists 2: In dir.create(paste0(opt$output, "/", gene_i, "/SBayesR")) : './eQTL_to_TWAS/eQTL_to_TWAS/ENSG00000166535.txt/ENSG00000166535/SBayesR' already exists 3: In fread(paste0(opt$output, "/", gene_i, "/SBayesR/", genei, ".SBayesR.parRes")) : Detected 2 column names but the data has 3 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file. Execution halted

ENSG00000166535.txt

opain commented 1 year ago

Hello,

Thank you for reaching out and providing a clear description of the error.

This error occurs when there are no variants in common between the sumstats for a given gene and the reference genetic data. This perplexed me also - It appears there are chunks of genetic variants missing within the 1KG reference data, despite being within the hapmap3 snplist. The gene that you sent in your message appears to be within one of these missing chunks. This should not happen often as there are only ~5000 hapmap3 SNPs that are not present in the 1KG reference (in the version I am using at least).

In any case, I have just pushed an update (https://github.com/opain/eQTL_to_TWAS/commit/dff88a88d6bf030ff4bdac2ffac0fa281210e756) to the compute_weights script so that this error is caught, an appropriate message is printed to the screen, and then the gene is skipped.

Let me know if this doesn't resolve the issue for you.

Ollie

1667857557 commented 1 year ago

Hi Oliver,

Thank you for your reply. After correcting the code, it works well. However, we encountered a new unexpected error while computing the weights for some genes. The error seems to occur during the execution of the 'ldpred2' mode. Below is the error report and the relevant file. Could you please provide some suggestions on how to address it? Thanks in advance!

Huang

_Reading PLINK FAM file from [./eQTL_to_TWAS/ENSG00000172007/ref/ref_chr4.fam] 489 individuals to be included from reference FAM file. Reading PLINK BIM file from [./eQTL_to_TWAS/ENSG00000172007/ref/ref_chr4.bim] Calculating MAF of reference panel ... 770 SNPs to be included from reference BIM file. Reading summary data of small effect SNPs from [./eQTL_to_TWAS/ENSG00000172007/DBSLMM/s_ENSG00000172007.txt] Number of allele discrepency: 0 Number of maf discrepency: 0 After filtering, 768 small effect SNPs are selected. Reading summary data of large effect SNPs from [./eQTL_to_TWAS/ENSG00000172007/DBSLMM/l_ENSG00000172007.txt] Number of allele discrepency: 0 Number of maf discrepency: 0 After filtering, 1 large effect SNPs are selected. Fitting model... Fitting time: 0.332623 seconds. 777 variants to be matched. 0 ambiguous SNPs have been removed. 698 variants have been matched; 0 were flipped and 202 were reversed. Loading required package: foreach Loading required package: rngtools 690 variants to be matched. 0 ambiguous SNPs have been removed. 689 variants have been matched; 0 were flipped and 490 were reversed. Error: Incompatibility between dimensions. 'ind.col' and 'rows_along(A.col)' should have the same length. In addition: There were 32 warnings (use warnings() to see them) Execution halted_ ENSG00000172007.txt

opain commented 1 year ago

Hello Huang,

Sorry there are more errors!

Unfortunately I am unable to replicate the error - I have run the analysis using your sumstats and it worked as expected.

The numbers of SNPs in the output are slightly different, so I think the reference data we are using is slightly different. Can you send me the plink files you are specifying for the --plink_ref_chr parameter. I just need the .bed/.bim/.fam for chromosome 4?

Alternatively, you can open R, create the opt list, input your command line parameters, and then run the code interactively (section by section) to help understand the source of the error. The error appears to be occurring due to this line https://github.com/opain/eQTL_to_TWAS/blob/3f9a903126278391325c922277eb4c8c7632c2f3/compute_weights.R#L674

Best wishes,

Ollie

opain commented 1 year ago

Actually, don't worry. I have just found another copy of the 1KG reference data that reproduces the error. I will look into the cause now.

opain commented 1 year ago

Hello Huang,

The error was being caused by variants in the eQTL sumstats that were not in your plink reference files, but were in the ldpred2 reference. Sorry for that!

I have just pushed an update to the main branch which fixes the issues, by restricting the eQTL sumstats to variants that are present in the plink reference files.

Note. while working on this issue, I updated my version of bigsnpr, and realised they have introduced an update to their ldpred2 auto function that was leading to convergence issues when restricted to a single gene. I have also fixed this, but you should rerun the code for any genes that completed previously.

Let me know if you have any other issues and I will try to resolve quickly. Thank you for your feedback.

Ollie

1667857557 commented 1 year ago

Sorry for the late reply! Thanks for your advice, and it does help me!