Open Nahyeon203 opened 4 months ago
Hi,
I am sorry for the delay in my response.
1.
After investigation I could not reproduce the first issue.
When testing the same command line:
python arbock.py predict --model ./models/ds_model_no_pheno --input ~/test --gene_id_format Ensembl --gene_id_delim=, --prediction_output_folder ~/output --analysis_name test
with a file containing only:
ENSG00000165280,ENSG00000144136
I obtain this output without any error:
gene_name_A gene_name_B ens_id_A ens_id_B predicted_proba
VCP SLC20A1 ENSG00000165280 ENSG00000144136 0.9083598963667406
I tried to launch the following explain
command prior to do the predict
one, and no problem either:
python arbock.py explain --model ./models/ds_model_no_pheno --input ENSG00000165280,ENSG00000144136 --gene_id_format Ensembl --gene_id_delim=, --prediction_output_folder ~/output --analysis_name test
Maybe the cache is corrupted somehow?
You could try adding the following parameter: --update_step_caches
.
This will probably solve this issue. If not, could you please provide a minimal input to reproduce the error?
2.
The probability of 0.12 for this model is returned whenever no rules are matched in the decision set.
Quoting the paper Methods section:
" The model decision process was set up according to these criteria: (1) if a gene pair matches multiple rules in the decision set, the rule with the highest probability estimate is chosen; (2) If a gene pair does not match any of the rules, it is predicted as neutral with a probability estimate based on uncovered training instances. "
The probability estimate for the case (2) is based on this formula:
|uncovered_positive_instance| / (|uncovered_positive_instances| + |uncovered_negative_instances|)
Of course, since there is an optimal threshold, it is possible to have a rule match (with the probability estimate associated to that rule) while being predicted as neutral, if that probability is lower than the threshold. But if there is no rule match, it will always be predicted as neutral.
Hi, Thank you for your response.
I implemented the solution you suggested, and it worked perfectly! Using the --update_step_caches
parameter resolved the problem.
Regarding your explanation, you mentioned that if there is no matching rule for the input gene pairs, the result would be 0.12. Does this imply that even if there are matching rules, the result could still be below 0.788? Could you perhaps provide an example of a gene pair that matches a rule but results in a value below 0.788?
Hi,
I am encountering two issues while using your tool and would appreciate your assistance:
The gene pair I used is VCP-SLC20A1. Could you please help me understand why this is happening?