'there shouldn't be any nulls' and 'feature from the model not in data' log messages

molgenis / capice

GNU Lesser General Public License v3.0

22 stars 10 forks source link

'there shouldn't be any nulls' and 'feature from the model not in data' log messages #24

Closed dennishendriksen closed 3 years ago

dennishendriksen commented 4 years ago

Running CAPICE on trio-filtered.vcf.gz using the CAPICE easybuild module on gearshift results in the log: cadd_capice.log.

The log file contains entries such as:

After imputation, there shouldn't be any nulls, but check below:

motifEName 46 0.96
GeneID 4 0.08
GeneName 4 0.08
CCDS 9 0.19
Intron 42 0.88
Exon 18 0.38

 False
Categorical variables 10
In total, there are 48 samples

Feature from the model not in data:  Alt_AA
Feature from the model not in data:  Alt_other
Feature from the model not in data:  PolyPhenCat_possibly_damaging
Feature from the model not in data:  Ref_CT
Feature from the model not in data:  Segway_R5
Feature from the model not in data:  Segway_TF0
Feature from the model not in data:  Type_INS
Feature from the model not in data:  nAA_V
(48, 131)

What do these messages mean? Do they indicate a problem?

dennishendriksen commented 4 years ago

@shuang1330 @joerivandervelde @SietsmaRJ do you have any thoughts on this?

shuang1330 commented 4 years ago

Hi,

I put a lot of print lines in the preprocessing steps... The "there should'n be any nulls" is to about imputation, but I used to just look at the following printing lines to see whether there are still columns that should have been imputed or not, so no automatic check procedures are there.. The "feature from the model not in the data '' is because that for this test dataset, there are certain levels in the categorical features that do not exist, which is normal.

Best regards, Shuang

On Thu, 27 Aug 2020 at 09:27, Dennis Hendriksen notifications@github.com wrote:

@shuang1330 https://github.com/shuang1330 @joerivandervelde https://github.com/joerivandervelde @SietsmaRJ https://github.com/SietsmaRJ do you have any thoughts on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/molgenis/capice/issues/24#issuecomment-681685564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAVSYICT3TBNUOR3RDJ4BDSCYDEVANCNFSM4QMVFF3A .

SietsmaRJ commented 4 years ago

I can further confirm that the variables marked as having null ratio's are not further used in predicting variants. The features marked as "Feature from the model not in data:" remain also unused. They should not indicate a problem in my experience.