Open jerome-f opened 9 months ago
Hi Jerome,
Thank you for your request. CADD was trained with MMsplice 1.0.1 and scores are generated with this specific version. Therefore we don't want to upgrade MMsplice because predictions might be different and the scores will be different. This leads to a disagreement with the pre-computated whole genome files. We also see that sometimes scores are totally off when a feature has different values due to versions/environments. We can keep your request and update MMsplice on the next CADD release. Of course you can use the newer version in your own code base. But you really have to ensure that predictions on the whole genome are equal.
Best, Max
Max @visze
Thanks for reply and totally agree with that. I canceled the PR I generated for this realizing that any small change can affect the scores overall. But it would be great if the new version can patch the most recent MMSplice. Legacy dependency such as concise, cython=0.29 (current cython is 3.x) etc are going to be affecting code efficiency. Plus the improvement from the new model probably would add to the CADD score in a positive way (can't gauge this at my end). I am having trouble installing the scripts locally due to conflicts. (Snakemake with mamba fails at multiple instances).
@visze also micromamba is able to solve the environment better.
@visze and @makirc The core models that are used in mmsplice doesn't look like they have changed since v1.0.1 -> v2.4.0. Check here What has been changed is all the infrastructure around it. I am still having trouble installing the mmsplice environment in my cluster. I get TypeError with the metaclass which is very cryptic to solve given everything is outdated
Using TensorFlow backend.
Traceback (most recent call last):
File "/CADD/src/scripts/lib/tools/MMSplice.py", line 6, in <module>
from mmsplice.vcf_dataloader import SplicingVCFDataloader
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/mmsplice/__init__.py", line 7, in <module>
from keras.models import load_model
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/keras/__init__.py", line 3, in <module>
from . import utils
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/keras/utils/__init__.py", line 6, in <module>
from . import conv_utils
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/keras/utils/conv_utils.py", line 9, in <module>
from .. import backend as K
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/keras/backend/__init__.py", line 89, in <module>
from .tensorflow_backend import *
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 5, in <module>
import tensorflow as tf
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 63, in <module>
from tensorflow.python.framework.framework_lib import * # pylint: disable=redefined-builtin
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/tensorflow/python/framework/framework_lib.py", line 25, in <module>
from tensorflow.python.framework.ops import Graph
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 53, in <module>
from tensorflow.python.platform import app
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 24, in <module>
from tensorflow.python.platform import flags
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/tensorflow/python/platform/flags.py", line 25, in <module>
from absl.flags import * # pylint: disable=wildcard-import
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/absl/flags/__init__.py", line 35, in <module>
from absl.flags import _argument_parser
File "/CADD/envs/conda/d7f147b1cd6acba35d2bf7623abb9dd4_/lib/python3.6/site-packages/absl/flags/_argument_parser.py", line 82, in <module>
class ArgumentParser(Generic[_T], metaclass=_ArgumentParserCache):
TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases
Hi,
I randomly picked a set of splice/indel variants (n=10,000) genome wide and ran CADD with mmsplice 2.4.0 as well and concurrently submitted the same set of variants for prediction on the server, here is the plot of RawScore
and PHRED_score
comparison between the two versions._x
used mmsplice 2.4.0,_y
is from CADD prediction from server.
Outliers above are not splice variants in the new version of MMSplice hence they have NaN in their columns. So essentially proving the model's PHRED scale (which is relative) still holds for the updated version of MMSplice where variants are concordantly identified as splice variants. re-calibrating genome wide predictions to remove variants that are not splice variants anymore should actually be a good thing ? 1% of the variants had np.linalg.norm (PHRED_mmsplice_2.4.0 - PHRED_mmsplice_1.0.1 ) > 1. i.e. absolute difference of PHRED >1 between the variants. and 0.5% (57 variants ) had absolute difference of PHRED >2. stats below:
abs(PHRED_mmsplice_2.4.0 - PHRED_mmsplice_1.0.1) > 0 : 766 (% 7.66)
abs(PHRED_mmsplice_2.4.0 - PHRED_mmsplice_1.0.1) > 1 : 100 (% 1.0)
abs(PHRED_mmsplice_2.4.0 - PHRED_mmsplice_1.0.1) > 2 : 57 (% 0.57)
abs(PHRED_mmsplice_2.4.0 - PHRED_mmsplice_1.0.1) > 3 : 39 (% 0.39)
abs(PHRED_mmsplice_2.4.0 - PHRED_mmsplice_1.0.1) > 4 : 30 (% 0.3)
abs(PHRED_mmsplice_2.4.0 - PHRED_mmsplice_1.0.1) > 5 : 24 (% 0.24)
I focused on the 24 variants where the difference is > 5 and found some of them are not even splice variants (which may be due to hardware change ?, this variation is unlikely to be due to mmsplice version as both the versions report NaN in the MMSp_ columns).
These variants and their corresponding MMSplice scores b/w two versions as well as their deviations are shown below.
| varid | mamPhyloP_x | MMSp_acceptorIntron_x | MMSp_acceptor_x | MMSp_exon_x | MMSp_donor_x | MMSp_donorIntron_x | RawScore_x | PHRED_x | mamPhyloP_y | MMSp_acceptorIntron_y | MMSp_acceptor_y | MMSp_exon_y | MMSp_donor_y | MMSp_donorIntron_y | RawScore_y | PHRED_y | deviation
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
10_28583504_CAGTA_C | 4.494 | NaN | NaN | NaN | NaN | NaN | 5.025839 | 28.100 | 4.494 | 0.000 | 0.000 | -0.090 | -9.066 | -0.111 | 6.152494 | 34.000 | 5.900
10_73692042_C_CTA | 2.166 | NaN | NaN | NaN | NaN | NaN | 4.525003 | 25.600 | 2.166 | 0.000 | 0.000 | 0.000 | -6.906 | -0.030 | 5.404265 | 31.000 | 5.400
10_95966268_ACTGC_TTGTTAATA | 1.406 | NaN | NaN | NaN | NaN | NaN | 4.801243 | 26.800 | 1.406 | 0.000 | 0.000 | -0.582 | 0.000 | 0.000 | 5.554649 | 32.000 | 5.200
17_6758911_AGAGATGGGGTCTGAGAGTTGGGGGACGAGGGTCC... | 0.393 | NaN | NaN | NaN | NaN | NaN | -0.099667 | 0.726 | 0.393 | 0.000 | 0.000 | -0.479 | -7.457 | 0.093 | 1.467171 | 13.990 | 13.264
18_53530568_ATGAACAGGATGATC_A | 4.494 | NaN | NaN | NaN | NaN | NaN | 6.600207 | 35.000 | 4.494 | 0.000 | 0.000 | -2.438 | 0.000 | 0.000 | 9.756225 | 43.000 | 8.000
20_20605411_GTCTAAAACCAACCAAC_TAG | 3.370 | NaN | NaN | NaN | NaN | NaN | 4.761063 | 26.600 | 3.370 | 0.897 | -6.046 | -0.121 | 0.000 | 0.000 | 5.698128 | 33.000 | 6.400
21_45282947_ACAT_TCCCATGTGCCCATCC | -0.801 | NaN | NaN | NaN | NaN | NaN | -0.364168 | 0.255 | -0.801 | -0.540 | -4.341 | 0.000 | 0.000 | 0.000 | 0.706265 | 7.439 | 7.184
22_21970405_ACCTGGGGGTGGGGAGTGG_A | 2.919 | NaN | NaN | NaN | NaN | NaN | 4.357136 | 25.000 | 2.919 | 0.000 | 0.000 | 0.002 | 0.000 | 0.000 | 5.766426 | 33.000 | 8.000
4_152628560_CACTGTAAAAAAAAA_C | 3.352 | NaN | NaN | NaN | NaN | NaN | 4.162059 | 24.500 | 3.352 | -0.552 | -7.442 | 0.000 | 0.000 | 0.000 | 5.747699 | 33.000 | 8.500
4_153350128_CAGTA_C | 4.437 | NaN | NaN | NaN | NaN | NaN | 4.593066 | 25.900 | 4.437 | 0.000 | 0.000 | 0.031 | -6.862 | -0.057 | 5.386978 | 31.000 | 5.100
7_6692598_TTACC_T | 0.866 | NaN | NaN | NaN | NaN | NaN | 3.457845 | 22.600 | 0.866 | 0.000 | 0.000 | -0.500 | -8.223 | 0.016 | 5.124712 | 28.700 | 6.100
7_17310107_CAAGAGCTTCTTTGATGGT_C | 4.481 | NaN | NaN | NaN | NaN | NaN | 5.062152 | 28.300 | 4.481 | 0.000 | 0.000 | -0.407 | -7.152 | -0.016 | 6.086781 | 34.000 | 5.700
7_103083064_AGG_A | 3.130 | NaN | NaN | NaN | NaN | NaN | 0.594483 | 6.402 | 3.130 | -0.213 | -7.557 | 0.018 | 0.000 | 0.000 | 2.110585 | 17.260 | 10.858
7_114942367_GGAAATCCTTCGGATGGTGAACTCATTAGAAGTA... | 4.396 | NaN | NaN | NaN | NaN | NaN | 4.452831 | 25.300 | 4.396 | 0.000 | 0.000 | 0.110 | -9.075 | -0.135 | 5.547729 | 32.000 | 6.700
8_10497971_T_TCTAAAGAACGAATATAAAAA | 2.558 | NaN | NaN | NaN | NaN | NaN | 5.038503 | 28.200 | 2.558 | 0.000 | -1.381 | -0.911 | 0.000 | 0.000 | 6.390746 | 34.000 | 5.800
8_73024812_TTTTCTTTAGCCTTTCAAAAGCAAA_T | 4.494 | NaN | NaN | NaN | NaN | NaN | 4.877491 | 27.200 | 4.494 | -0.023 | -6.939 | -0.019 | 0.000 | 0.000 | 5.906766 | 33.000 | 5.800
8_138720725_GGCAAAGTCCATACC_CG | 3.217 | NaN | NaN | NaN | NaN | NaN | 4.550096 | 25.700 | 3.217 | 0.000 | 0.000 | -0.032 | -7.067 | -0.177 | 5.474991 | 32.000 | 6.300
most of the variants are novel. and all of them have higher PHRED when using the mmsplice==1.0.1. Variant "17_6758911_AGAGATGGGGTCTGAGAGTTGGGGGACGAGGGTCCAGTCCTCCCTGCAGGT_A" is the highest deviation with 13.2. is a splice_donor variant from VEP but I am not sure why MMSplice 2.4.0 didn't score this one. another example is "7_103083064_AGG_A" where the deviation is >10 and a splice variant MMSplice 2.4.0 didn't score this variant either. I am not sure why this is, it could be due to the version changes. But amongst the scored variants the deviation in PHRED is insignificant proving that the upgrade to MMSplice2.4.0 doesn't impact the CADD PHRED significantly.
I released a new CADD-scripts version v1.7.1. Maybe you try that one. Now it is recommended to use apptainer/singularity and all environments are packed within a container and no conda builds are needed (container is 17GB large). You also need now snakemake 8.
Also I updated the environments. So If you use mamba/conda instead I hope you will not face the issues you had above
Hi,
I am trying to install CADD-scripts on my local env and the legacy dependency of mmsplice 1.0.1 with concise is giving me problems installing. Since mmsplice 2.x the concise dependency has be integrated into the core api and much of the predictions are 1:1 with the legacy api. would it be possible to bump the version of mmsplice to the most recent version?
I am trying to have local installation of CADD v1.7.