The preprocessor does not take into account the train_features.json, so it processes all "object" type features within a supplied dataset for both train and predict, adding unnecessary processing time and resources.
Preprocessor takes into account the train_features.json to skip the features that are in an input dataset, but not in the train_features.json
Logs
2022-10-05 14:07:30 INFO: Preprocessor started.
2022-10-05 14:07:37 INFO: Training protocol, creating new categorical conversion identifiers.
2022-10-05 14:07:37 INFO: For feature: ref saved the following values: C, G, T, A, CT
2022-10-05 14:07:38 INFO: For feature: alt saved the following values: T, A, C, G, CT
2022-10-05 14:07:38 INFO: For feature: Allele saved the following values: A, T, G, C, -
2022-10-05 14:07:38 INFO: For feature: IMPACT saved the following values: LOW, MODIFIER, MODERATE, HIGH
2022-10-05 14:07:38 INFO: For feature: BIOTYPE saved the following values: protein_coding, lncRNA, RNase_MRP_RNA, misc_RNA, transcribed_pseudogene
2022-10-05 14:07:38 INFO: For feature: Exon saved the following values: 3/3, 2/2, 4/4, 1/1, 10/10
2022-10-05 14:07:38 INFO: For feature: Intron saved the following values: 2/2, 9/9, 8/9, 8/8, 3/5
2022-10-05 14:07:38 INFO: For feature: Codons saved the following values: gaC/gaT, gcC/gcT, gaG/gaA, ccC/ccT, aaC/aaT
2022-10-05 14:07:39 INFO: For feature: FLAGS saved the following values: cds_start_NF, cds_end_NF
2022-10-05 14:07:39 INFO: For feature: SpliceAI_pred_SYMBOL saved the following values: TTN, BRCA2, NF1, ATM, BRCA1
2022-10-05 14:07:39 INFO: For feature: gnomAD saved the following values: 18:55335787-55335787, 20:62046497-62046497, 12:21375307-21375307, 3:37067121-37067121, 7:82763559-82763559
2022-10-05 14:07:39 INFO: For feature: oAA saved the following values: L, A, R, P, S
2022-10-05 14:07:39 INFO: For feature: nAA saved the following values: L, S, X, A, T
2022-10-05 14:07:40 INFO: For feature: Type saved the following values: SNV, DEL, INS, DELINS
2022-10-05 14:07:40 INFO: For feature: PolyPhenCat saved the following values: benign, probably_damaging, possibly_damaging
2022-10-05 14:07:40 INFO: For feature: SIFTcat saved the following values: deleterious, tolerated
2022-10-05 14:07:41 INFO: Successfully preprocessed data.
Describe the bug
The preprocessor does not take into account the
train_features.json
, so it processes all "object" type features within a supplied dataset for both train and predict, adding unnecessary processing time and resources.System information (command line)
Not applicable.
System information (web service)
Not applicable.
How to Reproduce
Steps to reproduce the behavior: capice -v train -i /path/to/train.tsv.gz -m /path/to/train_features.json -o /path/to/out.pickle.dat capice -v predict -i /path/to/predict.tsv.gz -m /path/to/model.pickle.dat -o /path/to/out.tsv.gz
Expected behavior
Preprocessor takes into account the
train_features.json
to skip the features that are in an input dataset, but not in thetrain_features.json
Logs