ipw012 / RIVER

R package for RIVER (RNA-Informed Variant Effect on Regulation)
10 stars 4 forks source link

NAs in feature matrix #2

Open jeinson opened 6 years ago

jeinson commented 6 years ago

When generating a matrix of features for RIVER, how do the developers handle situations where no variant near a particular gene has a CADD annotation for features like TFBS or EncOCCombPVal? glmnet cannot handle NAs, but n my dataset 95% of genes have at least one missing feature annotation, so removing such cases would waste most of the data.

Ex: cHmmTx cHmmTssBiv cHmmHet cHmmBivFlnk cHmmTxFlnk TFBS EncOCCombPVal
GTEX-111YS:ENSG00000007923 0.016 0 0 0 0.000 NA NA
GTEX-117YW:ENSG00000007923 0.000 0 0 0 0.000 NA NA
GTEX-1192X:ENSG00000007923 0.000 0 0 0 0.000 NA NA
GTEX-11EM3:ENSG00000007923 0.000 0 0 0 0.008 NA NA
GTEX-11EQ8:ENSG00000007923 0.000 0 0 0 0.000 NA NA
GTEX-11EQ9:ENSG00000007923 0.016 0 0 0 0.000 NA NA
ipw012 commented 6 years ago

Especially for annotation from ENCODE like chromatin states and TFBS, there are many NAs. In those cases, we used a minimum number (0), which is background. This is also what CADD used in their variant feature imputations.