molgenis / capice

GNU Lesser General Public License v3.0
22 stars 10 forks source link

Preprocessor processes irrelevant training features #139

Closed SietsmaRJ closed 1 year ago

SietsmaRJ commented 1 year ago

Describe the bug

The preprocessor does not take into account the train_features.json, so it processes all "object" type features within a supplied dataset for both train and predict, adding unnecessary processing time and resources.

System information (command line)

Not applicable.

System information (web service)

Not applicable.

How to Reproduce

Steps to reproduce the behavior: capice -v train -i /path/to/train.tsv.gz -m /path/to/train_features.json -o /path/to/out.pickle.dat capice -v predict -i /path/to/predict.tsv.gz -m /path/to/model.pickle.dat -o /path/to/out.tsv.gz

Expected behavior

Preprocessor takes into account the train_features.json to skip the features that are in an input dataset, but not in the train_features.json

Logs

2022-10-05 14:07:30     INFO: Preprocessor started.
2022-10-05 14:07:37     INFO: Training protocol, creating new categorical conversion identifiers.
2022-10-05 14:07:37     INFO: For feature: ref saved the following values: C, G, T, A, CT
2022-10-05 14:07:38     INFO: For feature: alt saved the following values: T, A, C, G, CT
2022-10-05 14:07:38     INFO: For feature: Allele saved the following values: A, T, G, C, -
2022-10-05 14:07:38     INFO: For feature: IMPACT saved the following values: LOW, MODIFIER, MODERATE, HIGH
2022-10-05 14:07:38     INFO: For feature: BIOTYPE saved the following values: protein_coding, lncRNA, RNase_MRP_RNA, misc_RNA, transcribed_pseudogene
2022-10-05 14:07:38     INFO: For feature: Exon saved the following values: 3/3, 2/2, 4/4, 1/1, 10/10
2022-10-05 14:07:38     INFO: For feature: Intron saved the following values: 2/2, 9/9, 8/9, 8/8, 3/5
2022-10-05 14:07:38     INFO: For feature: Codons saved the following values: gaC/gaT, gcC/gcT, gaG/gaA, ccC/ccT, aaC/aaT
2022-10-05 14:07:39     INFO: For feature: FLAGS saved the following values: cds_start_NF, cds_end_NF
2022-10-05 14:07:39     INFO: For feature: SpliceAI_pred_SYMBOL saved the following values: TTN, BRCA2, NF1, ATM, BRCA1
2022-10-05 14:07:39     INFO: For feature: gnomAD saved the following values: 18:55335787-55335787, 20:62046497-62046497, 12:21375307-21375307, 3:37067121-37067121, 7:82763559-82763559
2022-10-05 14:07:39     INFO: For feature: oAA saved the following values: L, A, R, P, S
2022-10-05 14:07:39     INFO: For feature: nAA saved the following values: L, S, X, A, T
2022-10-05 14:07:40     INFO: For feature: Type saved the following values: SNV, DEL, INS, DELINS
2022-10-05 14:07:40     INFO: For feature: PolyPhenCat saved the following values: benign, probably_damaging, possibly_damaging
2022-10-05 14:07:40     INFO: For feature: SIFTcat saved the following values: deleterious, tolerated
2022-10-05 14:07:41     INFO: Successfully preprocessed data.