statgen / raremetal

A flexible tool for meta-analysis
8 stars 9 forks source link

Raremetal filter out all the variants #31

Open stefanucci-luca opened 3 years ago

stefanucci-luca commented 3 years ago

Dear raremetal developers,

I really appreciate the work you are doing to develop and maintain raremetal. I am reaching out because I need some help to understand raremetal behaviour.

Recently, I am trying to run rare metal on a set of variants (~50) on 3 genes. The cohort has about 130,000 individuals. I am using the commands I pasted below. The single variants association runs ok, and I get the score and cov files. When I run raremetal burden and skat (but I also tried the other weighted approaches), all the variants are filtered out; see log file below.

Do you have any suggestion on how to fix this behaviour?

Any suggestion is much appreciated,

Best,

Luca

PED

1000150 1000150 0 0 0 0 1 30 67 1
1000184 1000184 0 0 0 0 1 39 42 1
1000200 1000200 0 0 0 0 0 24 42 0
1000261 1000261 0 0 0 0 1 25 49 2

DAT

T Trait
C SEX
C BMI
C AGE
C SMOKE

VCF

chr3    93874276        chr3:93874276:G:A       G       A       50      .       AF=2e-06;AQ=50;AC=1;AN=262044   GT:DP:AD:GQ:PL:RNC      0/0:16:16 .........
chr3    93874282        chr3:93874282:G:A       G       A       47      .       AF=1.2e-05;AQ=47;AC=4;AN=262044 GT:DP:AD:GQ:PL:RNC      0/0:16:16 ..........

Commands

raremetalworker --ped VTE.ped --dat ${out_dir}/${gn_name}_variopath_burden.dat --vcf ${out_dir}/${gn_name}_variopath_variants.vcf.gz \
        --traitName Trait --inverseNormal --makeResiduals --prefix ${out_dir}/${gn_name}_rare_metalworker \
        --dominant --geneMap ${reflat}

and

raremetal --summaryFiles ${out_dir}/score_file --covFiles ${out_dir}/cov_file \
--geneMap ${reflat} --groupFile ${group} \
--SKAT  --burden --longOutput --altMAF --tabulateHits --prefix ${out_dir}/burden_test_cumulative 

Raremetalworker output score

##ProgramName=RareMetalWorker
##Version=4.15.1
##Samples=131022
##AnalyzedSamples=130458
##Families=131022
##AnalyzedFamilies=130458
##Founders=131022
##MakeResiduals=True
##AnalyzedFounders=130458
##Covariates=SEX,BMI,AGE,SMOKE
##CovariateSummaries    min 25th    median  75th    max mean    variance
##SEX   0   0   0   1   1   0.455488    0.248021
##BMI   12  24  26  29  68  26.8007 22.4515
##AGE   38  50  58  63  72  56.5656 64.4416
##SMOKE 0   0   0   1   2   0.546168    0.436638
##InverseNormal=ON
##TraitSummaries    min 25th    median  75th    max mean    variance
##Trait 0   0   0   0   1   0.0440832   0.0421402
## - NullModelEstimates
## - Name   BetaHat SE(BetaHat)
## - Intercept  -7.04406e-08    0.00276861
##AnalyzedTrait -4.47433    -0.674924   0.000758956 0.674972    4.47433 -7.04406e-08    0.999996
##Sigma_e2_Hat  0.999988
#CHROM  POS REF ALT N_INFORMATIVE   FOUNDER_AF  ALL_AF  INFORMATIVE_ALT_AC  CALL_RATE   HWE_PVALUE  N_REF   N_HET   N_ALT   U_STAT  SQRT_V_STAT ALT_EFFSIZE PVALUE
chr3    93874276    G   A   130458  0.0000038   0.0000038   1   1   1   130457  1   0   0.0556638   10.0556635  0.95561

Raremetalworker output cov

##ProgramName=RareMetalWorker
##Version=4.15.1
#CHROM  CURRENT_POS MARKERS_IN_WINDOW   COV_MATRICES
chr3    93874276    93874276,93874282,93874387,93877004,93877074,93877089,93877159,93879213,93879279,93879306,93879315,93884755,93884868,93884869,93884886,93884889,93884899,93884905,93886407,93892993,93893024,93893025,93893082,93893108,93893139,93896594,93896595,93898500,93900875,93905828,93905889,93906016,93906059,93906104,93910618,93910664,93910672,93910681,93910696,93910697,93910706,93927245,93927251,93927284,93927336,93927362,93927363,93973667,93973753,   7.66533e-06,-2.3503e-10,-4.11303e-10,-2.9379e-10,-3.4668e-09,-4.40692e-09,-9.40265e-10,-1.35143e-09,-1.17518e-10,-1.17517e-10,-5.87589e-11,-5.8758e-11,-2.35036e-10,-1.17519e-10,-7.05101e-10,-2.87934e-09,-1.5865e-09,-1.17519e-10,-1.17515e-10,-8.8137e-10,-5.87585e-11,-5.87594e-11,-5.87585e-11,-4.70075e-10,-1.76294e-10,-2.93813e-10,-4.70107e-10,-5.87589e-11,-5.8758e-11,-6.46378e-10,-8.81498e-10,-1.17515e-10,-8.1673e-09,-5.87576e-11,-5.87765e-11,-2.93801e-10,-2.76169e-09,-1.28687e-08,-1.17539e-10,-5.87639e-11,-5.87657e-11,-4.17179e-09,-1.05765e-09,-3.52564e-10,-2.23279e-09,-2.3503e-10,-5.87576e-11,-3.93685e-09,-5.87576e-11,

Group file

SERPINC1    chr1:173903903:G:T chr1:173903969:G:T chr1:173903973:G:C chr1:173903996:T:C chr1:173904038:C:A chr1:173904044:C:T chr1:173909648:G:A chr1:173909665:A:T chr1:173909791:G:T chr1:173909819:C:G chr1:173909858:T:C chr1:173909900:C:T chr1:173910797:T:C chr1:173910831:G:A chr1:173910861:T:C chr1:173911851:T:C chr1:173911852:G:A chr1:173911918:A:G chr1:173911941:C:T chr1:173911942:G:A chr1:173911974:T:G chr1:173912015:C:A chr1:173914611:G:A chr1:173914659:G:C chr1:173914668:A:G chr1:173914696:G:A chr1:173914725:C:T chr1:173914726:G:A chr1:173914743:G:A chr1:173914788:G:A chr1:173914795:G:A chr1:173914818:G:T chr1:173914872:A:T 

log output raremetal

The following parameters are in effect:

List of Studies:
============================
--summaryFiles [./test_dir/score_file]
--covFiles [./test_dir/cov_file]

Grouping Methods:
============================
--groupFile [./test_dir/variant_group_file.txt]
--annotatedVcf []
--annotation []
--writeVcf [OFF]

QC Options:
============================
--hwe [1e-05]
--callRate [0.95]

Association Methods:
============================
--burden [true]
--MB [false]
--SKAT [true]
--VT [false]
--condition []

Other Options:
============================
--tabix [false]
--correctGC [false]
--prefix [./test_dir/burden_test_cumulative]
--maf [0.05]
--longOutput [true]
--tabulateHits [true]
--dosage [OFF]
--hitsCutoff [1e-06]
--altMAF [ON]
Analysis started at: Thu Apr  8 12:13:51 2021

Warning: variant 2:127426090 from study ./test_dir/PROC_rare_metalworker.Trait.singlevar.score.txt.gz is skipped because of duplicate records in the same study.

Warning: 1 variants are skipped from analysis due to duplicate records in study ./test_dir/PROC_rare_metalworker.Trait.singlevar.score.txt.gz. Please check log for details.

Warning: variant chr1:173903903:G:T is excluded from group SERPINC1 for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173903969:G:T is excluded from group SERPINC1 for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173903973:G:C is excluded from group SERPINC1 for the following reasons:monomorphic or failed QC.
...
Warning: variant chr1:173914668:A:G is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914696:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914725:C:T is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914726:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914743:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914788:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914795:G:A is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914818:G:T is excluded from group Triplet for the following reasons:monomorphic or failed QC.
Warning: variant chr1:173914872:A:T is excluded from group Triplet for the following reasons:monomorphic or failed QC.

Checking if all groups are analyzed...
Warning: the following groups has no qualifed variants to group and are skipped:
    SERPINC1, PROC, PROS1, Triplet, 
    done!
jonathonl commented 3 years ago

Is it possible that you are mixing b37 and b38 datasets when meta-analyzing? Or mixing chromosome names (chr1 vs 1) between datasets?

stefanucci-luca commented 3 years ago

Hi Jonathon,

Thanks for your quick reply.

All my data are in b38 and use chr* format for chromosomes. Is there any part of the script that sources the chromosome information somewhere else? I am not using the --geneMap option and the standard path (i.e. ../reflath19) won't work.

If this is the problem won't also reflect in the single variant association?

Thanks

Luca

jonathonl commented 3 years ago

So to clarify, when you say that "single variant association" works and gene-level does not, do you mean that raremetalworker works and raremetal does not? Or are you saying that running raremetal with --useExact works and running raremetal with --burden, --SKAT, etc. does not? I'm assuming you mean the former.

My earlier suspicion was that one of the studies in ./test_dir/score_file and ./test_dir/cov_file was in b38 and another was in b37, which would prevent raremetal from finding overlapping variants between the two studies.

What are the contents of ./test_dir/score_file? Can you verify that there are overlapping variants between the files in ./test_dir/score_file and that those variants overlap with the those in your group file?

stefanucci-luca commented 3 years ago

Yes, that raremetalworker works and raremetal does not.

score file has the path to the files

./test_dir/SERPINC1_rare_metalworker.Trait.singlevar.score.txt.gz
./test_dir/PROC_rare_metalworker.Trait.singlevar.score.txt.gz
./test_dir/PROS1_rare_metalworker.Trait.singlevar.score.txt.gz

similarly cov_file

./test_dir/SERPINC1_rare_metalworker.Trait.singlevar.cov.txt.gz
./test_dir/PROC_rare_metalworker.Trait.singlevar.cov.txt.gz
./test_dir/PROS1_rare_metalworker.Trait.singlevar.cov.txt.gz

The variants for /test_dir/score_file overlap with the one in group file.

for instance, the variant in ./test_dir/SERPINC1_rare_metalworker.Trait.singlevar.score.txt.gz are:

#CHROM  POS REF ALT
chr1    173903903   G   T
chr1    173903969   G   T
chr1    173903973   G   C
chr1    173903996   T   C
chr1    173904038   C   A
chr1    173904044   C   T
chr1    173909648   G   A
chr1    173909665   A   T
chr1    173909791   G   T
chr1    173909819   C   G
chr1    173909858   T   C
chr1    173909900   C   T
chr1    173910797   T   C
chr1    173910831   G   A
chr1    173910861   T   C
chr1    173911851   T   C
chr1    173911852   G   A
chr1    173911918   A   G
chr1    173911941   C   T
chr1    173911942   G   A
chr1    173911974   T   G
chr1    173912015   C   A
chr1    173914611   G   A
chr1    173914659   G   C
chr1    173914668   A   G
chr1    173914696   G   A
chr1    173914725   C   T
chr1    173914726   G   A
chr1    173914743   G   A
chr1    173914788   G   A
chr1    173914795   G   A
chr1    173914818   G   T
chr1    173914872   A   T

the variants in the group file for that gene are:

SERPINC1    chr1:173903903:G:T chr1:173903969:G:T chr1:173903973:G:C chr1:173903996:T:C chr1:173904038:C:A chr1:173904044:C:T chr1:173909648:G:A chr1:173909665:A:T chr1:173909791:G:T chr1:173909819:C:G chr1:173909858:T:C chr1:173909900:C:T chr1:173910797:T:C chr1:173910831:G:A chr1:173910861:T:C chr1:173911851:T:C chr1:173911852:G:A chr1:173911918:A:G chr1:173911941:C:T chr1:173911942:G:A chr1:173911974:T:G chr1:173912015:C:A chr1:173914611:G:A chr1:173914659:G:C chr1:173914668:A:G chr1:173914696:G:A chr1:173914725:C:T chr1:173914726:G:A chr1:173914743:G:A chr1:173914788:G:A chr1:173914795:G:A chr1:173914818:G:T chr1:173914872:A:T 
jonathonl commented 3 years ago

I see. The names of those score and cov files suggest that you have run raremetalworker on three different genes (presumably using the same samples/individuals). Raremetal expects that you run raremetalworker on two or more different sets of individuals for the same region of the genome (or entire genome). It then meta-analyzes the results of each sample set.

Does this make sense? Or am I misunderstanding what's going on?

stefanucci-luca commented 3 years ago

Thanks for pointing that out! I changed the workflow. Now, raremetalworker is performing one association with all my variants of interest producing only one score and cov file. However, raremetal still have the same problem. Any further suggestion?

jonathonl commented 3 years ago

I'm not sure that Raremetal will run for a single study, i.e., you would need more than one score file and more than one cov file. I suspect that Raremetal is not the tool you should be using. Raremetal is meant for meta-analysis of multiple studies that either have study-specific covariates and/or consent constraints that prevent the studies from being analyzed together. I'm not sure exactly what you are trying to achieve (polygenic analysis, gene-wise analysis accounting for hidden relatedness, etc.), but there are likely other tools better suited for your needs.

stefanucci-luca commented 3 years ago

I'd like to perform a burden test. Collapse all the effect sizes of variants that are within the same gene and have a 'cumulative' effect size. Isn't that possible with raremetal?

The problem is that is filtering out all the variants. What are the reasons for raremetal to filter out variants?

jonathonl commented 3 years ago

Raremetal will do burden tests if you have multiple datasets (sets of individuals), but it sounds like you only have one. You should look into EPACTS (https://genome.sph.umich.edu/wiki/EPACTS#Gene-wise_or_group-wise_tests) if your sample size is modest or if your sample size is large but you don't want to account for hidden relatedness. Otherwise, SAIGE-GENE (https://github.com/weizhouUMICH/SAIGE) will do burden tests for sample sizes in the hundreds of thousands and account for hidden relatedness.

stefanucci-luca commented 3 years ago

I'll look into these other softwares, thank you. Do you know why variants are filtered out then?

welchr commented 3 years ago

What happens if you run without --dominant?

stefanucci-luca commented 3 years ago

Dear Ryan,

Thanks for your suggestion. without --dominant in raremetalworker, raremetal return this error:

FATAL ERROR - 
No p-value available for calculating Genomic Control. Check if markers were filtered!

The log file is similar, all the variants are filtered out and it ends with:

Checking if all groups are analyzed...
Warning: the following groups has no qualifed variants to group and are skipped:
    SERPINC1, PROC, PROS1, Triplet,
welchr commented 3 years ago

For the variants that are filtered out when running the burden test (such as chr1:173903903:G:T), do they have a non-missing score stat and p-value in the single variant file produced by raremetalworker?

Also, as Jonathon mentions, there are other options for single study rare variant tests, such as SAIGE-GENE, GMMAT, and GENESIS. I think SAIGE-GENE might be the only one that supports very unbalanced case/control ratios (?)

stefanucci-luca commented 3 years ago

Hi Ryan,

Yes, majority of the variants get score and p-value in raremetalworker, then they are filtered anyway by raremetal. If possible, I'd prefer to fix this problem because for a collaboration and other analysis we are using raremetal and I'd like to keep it consistent.

For instance:

output raremetalworker score

chr3    93927245    C   T   131022  0.000270947 0.000270947 71  1   1   130951  71  0   21.1172 19.7996 0.0538671   0.286177
chr3    93927251    G   A   131022  6.86918e-05 6.86918e-05 18  0.999985    1   131002  18  0   2.29883 9.97128 0.0231209   0.817668
chr3    93927284    T   G   131022  2.28981e-05 2.28981e-05 6   0.999947    1   131009  6   0   -3.03879    5.75719 -0.0916813  0.59762
chr3    93927329    C   T   131022  0   0   0   1   1   131022  0   0   NA  NA  NA  NA
chr3    93927336    T   C   131022  0.000145014 0.000145014 38  1   1   130984  38  0   37.8297 14.4868 0.180254    0.00901951
chr3    93927362    C   T   131022  1.52646e-05 1.52646e-05 4   1   1   131018  4   0   9.38911 4.70076 0.424902    0.045786
chr3    93927363    G   A   131022  3.81615e-06 3.81615e-06 1   1   1   131021  1   0   10.9084 2.35041 1.97459 3.46589e-06
chr3    93927365    C   A   131022  3.81615e-06 3.81615e-06 1   1   1   131021  1   0   -0.506439   2.35041 -0.0916729  0.829402
chr3    93973667    T   C   131022  0.000255688 0.000255688 67  0.999977    1   130952  67  0   57.3926 19.2341 0.155137    0.00284595
chr3    93973753    G   C   131022  3.81615e-06 3.81615e-06 1   1   1   131021  1   0   -0.506439   2.35041 -0.0916729  0.829402

Raremetal log file

Warning: variant chr3:93927245:C:T is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927251:G:A is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927284:T:G is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927329:C:T is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927336:T:C is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927362:C:T is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927363:G:A is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927365:C:A is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93973667:T:C is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93973753:G:C is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
welchr commented 3 years ago

Huh, I'm at a loss there. Is your data confidential? Could you maybe send me the data for one of those genes? That way we can run it in the debugger and see what's going on.

stefanucci-luca commented 3 years ago

Hi Ryan,

It looks like that removing chr from the #CHROM field solved the problem. Now the majority of variants is included in the analysis.

I have a few another question, possibly, easier to solve now that we know where is the problem:

When I use --binary option I get this error:

ERROR: 3rd phenotype 2.0133 exists. Is that binary trait?
MESSAGE: This script is to calcualte odds ratio of variants based on parameter estimates from LMM generated by raremetalworker4.13.4 or rvtests.
    --Shuang Feng, September 10, 2014

however, the 6th columns of the ped file is:

cat VTE.ped | cut -d' ' -f 6 | sort | uniq
0
1

I can use calculateOR.pl, but then would these values be used in the burden test? Can I swap the columns to let raremetal use the OR rather than effect sizes?

Best,

Luca

welchr commented 3 years ago

Does 2.0133 appear in your phenotype file anywhere if you grep for it? Does the command line still include --inverseNormal or --makeResiduals? Could be your binary phenotype is being transformed.

I would be cautious using --binary as it doesn't seem to be documented anywhere on the wiki, and the code contains:

https://github.com/statgen/raremetal/blob/2c82cfc5710dbd9fd56ef67a7ca5f74772d4e70d/raremetalworker/src/FastFit.h#L71-L73

So my guess is the original authors didn't intend for it to be used.

My understanding of calculateOR.pl is that it takes the single variant effect size estimates previously generated by assuming your trait is quantitative (linear model), and converts them roughly to what they would have been if you had fit a logistic model instead. From what I can tell, it appears to be doing the following:

image

But unfortunately there is no written explanation that I can find on the wiki or in the script.

I believe the burden test only looks at the score statistic, covariance, and possibly allele frequencies (weights). So it will likely not take into account your updated effect size estimates. Actually the effects are just calculated from the score stat and variance, and written out to the file.

At the end of the day, for a binary trait, probably best to use software designed (and tested) to correctly handle them.