jianyangqt / gcta

GCTA software
GNU General Public License v3.0
84 stars 26 forks source link

Error in mtCOJO: there are too many SNPs that have large difference in allele frequency #53

Open statarat opened 1 year ago

statarat commented 1 year ago

Dear developers,

When running mtCOJO I am facing an error I have problems to solve.

My code: Accepted options: --bfile /home/all_snps --mtcojo-file /home/mtCOJO/mtcojo_summary_data.list.txt --ref-ld-chr /home/ldsc/eur_w_ld_chr/ --w-ld-chr /homeldsc/eur_w_ld_chr/ --out /home/CT_householdincome_mtcojo_results

Reading PLINK FAM file from [/home/all_snps.fam]. 362132 individuals to be included from [/home/all_snps.fam]. Reading PLINK BIM file from [/home/all_snps.bim]. 8213384 SNPs to be included from [/home/all_snps.bim].

Reading GWAS summary data from [/home/mtCOJO/mtcojo_summary_data.list.txt] ... 13312905 SNPs in common between the target trait and the covariate trait(s). Filtering out SNPs with multiple alleles or missing value ... 5185860 SNPs have missing value or mismatched alleles. These SNPs have been saved in [/home/mtCOJO/CT_householdincome_mtcojo_results.badsnps]. 8127045 SNPs are retained after filtering. There are 8946 genome-wide significant SNPs with p < 5.0e-08.

Reading PLINK BED file from [/home/all_snps.bed] in SNP-major format ... Genotype data for 362132 individuals and 8946 SNPs to be included from [/home/all_snps.bed]. Calculating allele frequencies ... Checking the difference in allele frequency between the GWAS summary datasets and the LD reference sample... 3630142 SNP(s) have large difference of allele frequency between the GWAS summary data and the reference sample. These SNPs have been saved in [/home/mtCOJO/CT_householdincome_mtcojo_results.freq.badsnps]. Error: there are too many SNPs that have large difference in allele frequency. Please check the GWAS summary data. An error occurs, please check the options or data

How should I proceed to solve the issue?

anglixue commented 1 year ago

Hi, It seems 3.6 million of SNPs have large difference of allele freq between your GWAS input and your LD reference. The most likely reason could be the allele order (A1 and A2) in your GWAS summary is somehow flipped. Please double-check if the 'freq' column is corresponding to your 'A1' column.

statarat commented 1 year ago

Thanks a lot for your reply @anglixue !

I have two set of GWAS summary statistics.

For dataset 1: ref --> reference allele alt --> alternative allele and effect allele

here freq corresponds to alt allele

For dataset 2: A1: effect allele A2: non-effect allele

I calculated the freq in plink and further merged SNPs with freq: --bfile /home/all_snps --freq --out /homemtCOJO/allele_freq

Then I did the following in R:

anglixue commented 1 year ago

Hi, The mtCOJO accepts the GCTA-COJO format GWAS summary, so the 'freq' column should always be the allele frequency for A1.

I would suggest you reformat your two GWAS summary stats to strictly follow the GCTA-COJO format and re-run the program.

statarat commented 1 year ago

Thanks again @anglixue. I followed once again your suggestion and reformatted again the alleles. However, I still get the same error message. Any suggestions?

anglixue commented 1 year ago

If both of your two GWAS summary follow the correct format, then that means there is something wrong with your data. One of your CT or SES summary has a completely reversed allele order.

The most straightforward way is to look at the *.freq.badsnps file, and extract the first SNP from CT and SES summary. You can display these two lines so we can do the diagnosis here.

Zhangzzzzzy commented 3 months ago

Dear developers, 亲爱的开发者们,

When running mtCOJO I am facing an error I have problems to solve.当运行mtCOJO时,我遇到了一个错误,我有问题要解决。

My code: 我的代码: Accepted options: 接受的选项: --bfile /home/all_snps --mtcojo-file /home/mtCOJO/mtcojo_summary_data.list.txt --mtcojo-file/home/mtCOJO/mtcojo_summary_data.list.txt --ref-ld-chr /home/ldsc/eur_w_ld_chr/ --w-ld-chr /homeldsc/eur_w_ld_chr/ --out /home/CT_householdincome_mtcojo_results

Reading PLINK FAM file from [/home/all_snps.fam].正在从[/home/all_snps.fam]阅读PLINK FAM文件。 362132 individuals to be included from [/home/all_snps.fam]. 从[/home/all_snps.fam]纳入362132个个体。 Reading PLINK BIM file from [/home/all_snps.bim]. 从[/home/all_snps.bim]阅读PLINK BIM文件。 8213384 SNPs to be included from [/home/all_snps.bim]. 从[/home/all_snps.bim]中纳入8213384个SNP。

Reading GWAS summary data from [/home/mtCOJO/mtcojo_summary_data.list.txt] ...正在从[/home/mtCOJO/mtcojo_summary_data.list.txt]阅读GWAS摘要数据. 13312905 SNPs in common between the target trait and the covariate trait(s). 13312905目标性状和协变量性状之间共有的SNP。 Filtering out SNPs with multiple alleles or missing value ... 筛选出具有多个等位基因或缺失值的SNP... 5185860 SNPs have missing value or mismatched alleles. These SNPs have been saved in [/home/mtCOJO/CT_householdincome_mtcojo_results.badsnps]. 5185860个SNP具有缺失值或错配等位基因。这些SNP已保存在[/home/mtCOJO/CT_householdincome_mtcojo_results.badsnps]中。 8127045 SNPs are retained after filtering. 8127045过滤后保留SNP。 There are 8946 genome-wide significant SNPs with p < 5.0e-08. 有8946个全基因组显著SNP,p < 5.0e-08。

Reading PLINK BED file from [/home/all_snps.bed] in SNP-major format ...正在以SNP-major格式从[/home/all_snps.bed]中阅读PLINK BED文件. Genotype data for 362132 individuals and 8946 SNPs to be included from [/home/all_snps.bed]. 362132个个体和8946个SNP的基因型数据将从[/home/all_snps.bed]中纳入。 Calculating allele frequencies ... 计算等位基因频率... Checking the difference in allele frequency between the GWAS summary datasets and the LD reference sample... 检查GWAS汇总数据集和LD参考样本之间等位基因频率的差异... 3630142 SNP(s) have large difference of allele frequency between the GWAS summary data and the reference sample. These SNPs have been saved in [/home/mtCOJO/CT_householdincome_mtcojo_results.freq.badsnps]. 3630142 SNP在GWAS汇总数据和参考样本之间的等位基因频率差异较大。这些SNP已保存在[/home/mtCOJO/CT_householdincome_mtcojo_results.freq.badsnps]中。 Error: there are too many SNPs that have large difference in allele frequency. Please check the GWAS summary data.错误:有太多的SNP,等位基因频率差异很大。请检查GWAS汇总数据。 An error occurs, please check the options or data 出现错误,请检查选项或数据

How should I proceed to solve the issue?我应该如何着手解决这个问题?

Hi, I have a question for you, where do you get the reference files for your --bfile requirements from? I don't know how to set this step. I hope to get your answer, thanks!

anglixue commented 3 months ago

@Zhangzzzzzy You can download the genotype from 1000G reference website.