Open stefanucci-luca opened 3 years ago
Is it possible that you are mixing b37 and b38 datasets when meta-analyzing? Or mixing chromosome names (chr1 vs 1) between datasets?
Hi Jonathon,
Thanks for your quick reply.
All my data are in b38 and use chr* format for chromosomes.
Is there any part of the script that sources the chromosome information somewhere else? I am not using the --geneMap
option and the standard path (i.e. ../reflath19
) won't work.
If this is the problem won't also reflect in the single variant association?
Thanks
Luca
So to clarify, when you say that "single variant association" works and gene-level does not, do you mean that raremetalworker
works and raremetal
does not? Or are you saying that running raremetal
with --useExact
works and running raremetal
with --burden
, --SKAT
, etc. does not? I'm assuming you mean the former.
My earlier suspicion was that one of the studies in ./test_dir/score_file
and ./test_dir/cov_file
was in b38 and another was in b37, which would prevent raremetal
from finding overlapping variants between the two studies.
What are the contents of ./test_dir/score_file
? Can you verify that there are overlapping variants between the files in ./test_dir/score_file
and that those variants overlap with the those in your group file?
Yes, that raremetalworker
works and raremetal
does not.
score file has the path to the files
./test_dir/SERPINC1_rare_metalworker.Trait.singlevar.score.txt.gz
./test_dir/PROC_rare_metalworker.Trait.singlevar.score.txt.gz
./test_dir/PROS1_rare_metalworker.Trait.singlevar.score.txt.gz
similarly cov_file
./test_dir/SERPINC1_rare_metalworker.Trait.singlevar.cov.txt.gz
./test_dir/PROC_rare_metalworker.Trait.singlevar.cov.txt.gz
./test_dir/PROS1_rare_metalworker.Trait.singlevar.cov.txt.gz
The variants for /test_dir/score_file overlap with the one in group file.
for instance, the variant in ./test_dir/SERPINC1_rare_metalworker.Trait.singlevar.score.txt.gz
are:
#CHROM POS REF ALT
chr1 173903903 G T
chr1 173903969 G T
chr1 173903973 G C
chr1 173903996 T C
chr1 173904038 C A
chr1 173904044 C T
chr1 173909648 G A
chr1 173909665 A T
chr1 173909791 G T
chr1 173909819 C G
chr1 173909858 T C
chr1 173909900 C T
chr1 173910797 T C
chr1 173910831 G A
chr1 173910861 T C
chr1 173911851 T C
chr1 173911852 G A
chr1 173911918 A G
chr1 173911941 C T
chr1 173911942 G A
chr1 173911974 T G
chr1 173912015 C A
chr1 173914611 G A
chr1 173914659 G C
chr1 173914668 A G
chr1 173914696 G A
chr1 173914725 C T
chr1 173914726 G A
chr1 173914743 G A
chr1 173914788 G A
chr1 173914795 G A
chr1 173914818 G T
chr1 173914872 A T
the variants in the group file for that gene are:
SERPINC1 chr1:173903903:G:T chr1:173903969:G:T chr1:173903973:G:C chr1:173903996:T:C chr1:173904038:C:A chr1:173904044:C:T chr1:173909648:G:A chr1:173909665:A:T chr1:173909791:G:T chr1:173909819:C:G chr1:173909858:T:C chr1:173909900:C:T chr1:173910797:T:C chr1:173910831:G:A chr1:173910861:T:C chr1:173911851:T:C chr1:173911852:G:A chr1:173911918:A:G chr1:173911941:C:T chr1:173911942:G:A chr1:173911974:T:G chr1:173912015:C:A chr1:173914611:G:A chr1:173914659:G:C chr1:173914668:A:G chr1:173914696:G:A chr1:173914725:C:T chr1:173914726:G:A chr1:173914743:G:A chr1:173914788:G:A chr1:173914795:G:A chr1:173914818:G:T chr1:173914872:A:T
I see. The names of those score and cov files suggest that you have run raremetalworker
on three different genes (presumably using the same samples/individuals). Raremetal expects that you run raremetalworker
on two or more different sets of individuals for the same region of the genome (or entire genome). It then meta-analyzes the results of each sample set.
Does this make sense? Or am I misunderstanding what's going on?
Thanks for pointing that out! I changed the workflow. Now, raremetalworker is performing one association with all my variants of interest producing only one score and cov file. However, raremetal still have the same problem. Any further suggestion?
I'm not sure that Raremetal will run for a single study, i.e., you would need more than one score file and more than one cov file. I suspect that Raremetal is not the tool you should be using. Raremetal is meant for meta-analysis of multiple studies that either have study-specific covariates and/or consent constraints that prevent the studies from being analyzed together. I'm not sure exactly what you are trying to achieve (polygenic analysis, gene-wise analysis accounting for hidden relatedness, etc.), but there are likely other tools better suited for your needs.
I'd like to perform a burden test. Collapse all the effect sizes of variants that are within the same gene and have a 'cumulative' effect size. Isn't that possible with raremetal?
The problem is that is filtering out all the variants. What are the reasons for raremetal to filter out variants?
Raremetal will do burden tests if you have multiple datasets (sets of individuals), but it sounds like you only have one. You should look into EPACTS (https://genome.sph.umich.edu/wiki/EPACTS#Gene-wise_or_group-wise_tests) if your sample size is modest or if your sample size is large but you don't want to account for hidden relatedness. Otherwise, SAIGE-GENE (https://github.com/weizhouUMICH/SAIGE) will do burden tests for sample sizes in the hundreds of thousands and account for hidden relatedness.
I'll look into these other softwares, thank you. Do you know why variants are filtered out then?
What happens if you run without --dominant
?
Dear Ryan,
Thanks for your suggestion. without --dominant
in raremetalworker, raremetal return this error:
FATAL ERROR -
No p-value available for calculating Genomic Control. Check if markers were filtered!
The log file is similar, all the variants are filtered out and it ends with:
Checking if all groups are analyzed...
Warning: the following groups has no qualifed variants to group and are skipped:
SERPINC1, PROC, PROS1, Triplet,
For the variants that are filtered out when running the burden test (such as chr1:173903903:G:T), do they have a non-missing score stat and p-value in the single variant file produced by raremetalworker?
Also, as Jonathon mentions, there are other options for single study rare variant tests, such as SAIGE-GENE, GMMAT, and GENESIS. I think SAIGE-GENE might be the only one that supports very unbalanced case/control ratios (?)
Hi Ryan,
Yes, majority of the variants get score and p-value in raremetalworker, then they are filtered anyway by raremetal. If possible, I'd prefer to fix this problem because for a collaboration and other analysis we are using raremetal and I'd like to keep it consistent.
For instance:
chr3 93927245 C T 131022 0.000270947 0.000270947 71 1 1 130951 71 0 21.1172 19.7996 0.0538671 0.286177
chr3 93927251 G A 131022 6.86918e-05 6.86918e-05 18 0.999985 1 131002 18 0 2.29883 9.97128 0.0231209 0.817668
chr3 93927284 T G 131022 2.28981e-05 2.28981e-05 6 0.999947 1 131009 6 0 -3.03879 5.75719 -0.0916813 0.59762
chr3 93927329 C T 131022 0 0 0 1 1 131022 0 0 NA NA NA NA
chr3 93927336 T C 131022 0.000145014 0.000145014 38 1 1 130984 38 0 37.8297 14.4868 0.180254 0.00901951
chr3 93927362 C T 131022 1.52646e-05 1.52646e-05 4 1 1 131018 4 0 9.38911 4.70076 0.424902 0.045786
chr3 93927363 G A 131022 3.81615e-06 3.81615e-06 1 1 1 131021 1 0 10.9084 2.35041 1.97459 3.46589e-06
chr3 93927365 C A 131022 3.81615e-06 3.81615e-06 1 1 1 131021 1 0 -0.506439 2.35041 -0.0916729 0.829402
chr3 93973667 T C 131022 0.000255688 0.000255688 67 0.999977 1 130952 67 0 57.3926 19.2341 0.155137 0.00284595
chr3 93973753 G C 131022 3.81615e-06 3.81615e-06 1 1 1 131021 1 0 -0.506439 2.35041 -0.0916729 0.829402
Warning: variant chr3:93927245:C:T is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927251:G:A is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927284:T:G is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927329:C:T is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927336:T:C is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927362:C:T is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927363:G:A is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93927365:C:A is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93973667:T:C is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Warning: variant chr3:93973753:G:C is excluded from group PROS1 for the following reasons:monomorphic or failed QC.
Huh, I'm at a loss there. Is your data confidential? Could you maybe send me the data for one of those genes? That way we can run it in the debugger and see what's going on.
Hi Ryan,
It looks like that removing chr
from the #CHROM
field solved the problem. Now the majority of variants is included in the analysis.
I have a few another question, possibly, easier to solve now that we know where is the problem:
When I use --binary
option I get this error:
ERROR: 3rd phenotype 2.0133 exists. Is that binary trait?
MESSAGE: This script is to calcualte odds ratio of variants based on parameter estimates from LMM generated by raremetalworker4.13.4 or rvtests.
--Shuang Feng, September 10, 2014
however, the 6th columns of the ped file is:
cat VTE.ped | cut -d' ' -f 6 | sort | uniq
0
1
I can use calculateOR.pl, but then would these values be used in the burden test? Can I swap the columns to let raremetal use the OR rather than effect sizes?
Best,
Luca
Does 2.0133
appear in your phenotype file anywhere if you grep for it? Does the command line still include --inverseNormal
or --makeResiduals
? Could be your binary phenotype is being transformed.
I would be cautious using --binary
as it doesn't seem to be documented anywhere on the wiki, and the code contains:
So my guess is the original authors didn't intend for it to be used.
My understanding of calculateOR.pl
is that it takes the single variant effect size estimates previously generated by assuming your trait is quantitative (linear model), and converts them roughly to what they would have been if you had fit a logistic model instead. From what I can tell, it appears to be doing the following:
But unfortunately there is no written explanation that I can find on the wiki or in the script.
I believe the burden test only looks at the score statistic, covariance, and possibly allele frequencies (weights). So it will likely not take into account your updated effect size estimates. Actually the effects are just calculated from the score stat and variance, and written out to the file.
At the end of the day, for a binary trait, probably best to use software designed (and tested) to correctly handle them.
Dear raremetal developers,
I really appreciate the work you are doing to develop and maintain raremetal. I am reaching out because I need some help to understand raremetal behaviour.
Recently, I am trying to run rare metal on a set of variants (~50) on 3 genes. The cohort has about 130,000 individuals. I am using the commands I pasted below. The single variants association runs ok, and I get the score and cov files. When I run raremetal burden and skat (but I also tried the other weighted approaches), all the variants are filtered out; see log file below.
Do you have any suggestion on how to fix this behaviour?
Any suggestion is much appreciated,
Best,
Luca
PED
DAT
VCF
Commands
and
Raremetalworker output score
Raremetalworker output cov
Group file
log output raremetal