kircherlab / ReMM

ReMM score snakemake workflow
https://remm.bihealth.org
MIT License
0 stars 1 forks source link

Feature correlation #7

Closed visze closed 2 years ago

visze commented 2 years ago

Correlate features for ReMM v1.4 on

visze commented 2 years ago

rule variant_generation_getFinalHg38 generates wrong positions for regions because the region is stored in the identifier and not the position:

see results/variant_generation/region_PRDM9/hg19/region_PRDM9.variants.liftover.success.pos.gz

visze commented 2 years ago

This works now! We are getting good correlations on "identical" regions and also good correlation on random 120K variants. Again conservation (especially GERP) is lower (or low) and fantom 5. Also rarVar but this might be again a calling problem... Will be definetly better when using GnomAD.

The rest seems to be pretty good and expected:

Here for HBB:

Feature_A   Spearman
CpGperGC    1
CpGperCpG   1
CpGobsExp   1
GCContent   1
priPhyloP   0.822592339
priPhastCons    0.805674731
verPhyloP   0.608850434
verPhastCons    0.749126154
mamPhyloP   0.544464214
mamPhastCons    0.660173394
EncH3K27Ac_v1_4 1
EncH3K4Me1_v1_4 1
EncH3K4Me3_v1_4 1
DnaseClusteredHyp   1
DnaseClusteredScore 1
Fantom5Perm 0.053155841
Fantom5Robust   1
GerpRS  0.049642469
GerpRSpv    0.052720806
rareVar 0.240687681
commonVar   0.950301924
fracRareCommon  0.802808521
dbVARCount_20211020 1
ISCApath_20211103   1
DGVCount_20200225   1
encRegTfbsClustered 0.916230407

Here on the random 120K:

Feature_A   Spearman
CpGperGC    0.545840058
CpGperCpG   0.700061713
CpGobsExp   0.826825632
GCContent   0.99943652
priPhyloP   0.784104142
priPhastCons    0.771535183
verPhyloP   0.625563755
verPhastCons    0.738359037
mamPhyloP   0.514096774
mamPhastCons    0.587509997
EncH3K27Ac_v1_4 0.986969132
EncH3K4Me1_v1_4 0.180135829
EncH3K4Me3_v1_4 0.086980665
DnaseClusteredHyp   0.575566493
DnaseClusteredScore 0.5713179
Fantom5Perm 0.054060585
Fantom5Robust   0.024932799
GerpRS  0.036443533
GerpRSpv    0.040130109
rareVar 0.300691947
commonVar   0.796447473
fracRareCommon  0.58842919
dbVARCount_20211020 0.995075961
ISCApath_20211103   0.992750371
DGVCount_20200225   0.985945513
encRegTfbsClustered 0.881248498
visze commented 2 years ago

I rethought about this. On which data we should do this. First I thought on the 120K random and the 3 regions. But I think they are much better in terms of score comparison than feature.

I think using all training data (hg38) and lifting this to hg19 might be the best option.

visze commented 2 years ago

correlation on training v0.4 hg38 liftover to hg19

pearson:

feature correlate pearson

spearman:

feature correlate spearman