Issues with –predict Option and Predicted Probability Values

Nahyeon203 commented 4 months ago

Hi,

I am encountering two issues while using your tool and would appreciate your assistance:

When I use the --explain option on the same gene pair, I receive results as expected. However, when I use the --predict option, an error occurs. I tried using both HGNC and Ensembl for the gene_id_format, but I still encountered the following error:

(ARBOCK) nh4658@ubuntu-a7:~/tools/ARBOCK$ python arbock.py predict --model ./models/ds_model_no_pheno --input test --gene_id_format Ensembl --gene_id_delim=, --prediction_output_folder ~/ALS/outputs/ARBOCK/test --analysis_name test
INFO:arbock.utils.cache_utils:Getting data from cache at: /home/nh4658/tools/ARBOCK/caches/bock_graph
INFO:arbock.utils.cache_utils:Getting data from cache at: /home/nh4658/tools/ARBOCK/caches/bock_index
INFO:arbock.utils.cache_utils:Getting data from cache at: /home/nh4658/tools/ARBOCK/caches/bock_nomenclature
INFO:arbock.utils.parallelizer:Initializing multiprocessing parallelizer with 40 cores
INFO:arbock.path_traversal.metapath_extracter:Running metapath extracter on 2 entity pairs. [path_cutoff=3, excl_node_types={'OligogenicCombination', 'Disease'}]
INFO:arbock.utils.cache_utils:Getting data from cache at: /home/nh4658/tools/ARBOCK/caches/test_metapath_extracter_path_cutoff_3_excl_node_types_Disease-OligogenicCombination
Traceback (most recent call last):
  File "/home/nh4658/tools/ARBOCK/arbock.py", line 316, in <module>
    main()
  File "/home/nh4658/tools/ARBOCK/arbock.py", line 83, in main
    predict(**vars(args))
  File "/home/nh4658/tools/ARBOCK/arbock.py", line 180, in predict
    ordered_samples, predict_probas, explanations = decision_set_classifier.predict_and_explain(gene_pairs_w, metapath_dict)
  File "/home/nh4658/tools/ARBOCK/arbock/model/decision_set_classifier.py", line 148, in predict_and_explain
    X_test_list.append([sample_id, metapath_dict[sample_id]])
KeyError: ('ENSG00000165280', 'ENSG00000144136')

The gene pair I used is VCP-SLC20A1. Could you please help me understand why this is happening?

For the predicted probabilities obtained, if the pairs do not exceed the threshold of 0.788 mentioned in the paper, the values are consistently 0.1209821. I checked this across a total of 180 pairs and observed the same outcome. Is the predicted probability for these pairs genuinely 0.1209821, or is this value set to appear for pairs that do not exceed the threshold?

arenaux commented 3 months ago

Hi,

I am sorry for the delay in my response.

1.

After investigation I could not reproduce the first issue.

When testing the same command line:

python arbock.py predict --model ./models/ds_model_no_pheno --input ~/test --gene_id_format Ensembl --gene_id_delim=, --prediction_output_folder ~/output --analysis_name test

with a file containing only:

ENSG00000165280,ENSG00000144136

I obtain this output without any error:

gene_name_A     gene_name_B     ens_id_A        ens_id_B        predicted_proba
VCP     SLC20A1 ENSG00000165280 ENSG00000144136 0.9083598963667406

I tried to launch the following explain command prior to do the predict one, and no problem either:

python arbock.py explain --model ./models/ds_model_no_pheno --input ENSG00000165280,ENSG00000144136 --gene_id_format Ensembl --gene_id_delim=, --prediction_output_folder ~/output --analysis_name test

Maybe the cache is corrupted somehow? You could try adding the following parameter: --update_step_caches.

This will probably solve this issue. If not, could you please provide a minimal input to reproduce the error?

2.

The probability of 0.12 for this model is returned whenever no rules are matched in the decision set.

Quoting the paper Methods section:

" The model decision process was set up according to these criteria: (1) if a gene pair matches multiple rules in the decision set, the rule with the highest probability estimate is chosen; (2) If a gene pair does not match any of the rules, it is predicted as neutral with a probability estimate based on uncovered training instances. "

The probability estimate for the case (2) is based on this formula:

|uncovered_positive_instance| / (|uncovered_positive_instances| + |uncovered_negative_instances|)

Of course, since there is an optimal threshold, it is possible to have a rule match (with the probability estimate associated to that rule) while being predicted as neutral, if that probability is lower than the threshold. But if there is no rule match, it will always be predicted as neutral.

Nahyeon203 commented 3 months ago

Hi, Thank you for your response.

I implemented the solution you suggested, and it worked perfectly! Using the --update_step_caches parameter resolved the problem.
Regarding your explanation, you mentioned that if there is no matching rule for the input gene pairs, the result would be 0.12. Does this imply that even if there are matching rules, the result could still be below 0.788? Could you perhaps provide an example of a gene pair that matches a rule but results in a value below 0.788?

oligogenic / ARBOCK

Issues with –predict Option and Predicted Probability Values #1