KeyError: 'DB00001' in get_scores_disease function

l4b4r4b4b4 commented 2 weeks ago

I was able to successfully load the model to the GPU and kick of inference.

However when wanting to get back the scores for the prediction I get an error, not matter which disease_id I use...


File "/opt/conda/lib/python3.11/site-packages/txgnn/TxEval.py", line 21, in eval_disease_centric
txgnn  |     self.out = disease_centric_evaluation(self.df, self.df_train, self.df_valid, self.df_test, self.data_folder, self.G, self.best_model,self.device, disease_idxs, relation, self.weight_bias_track, self.wandb, show_plot, verbose, return_raw, simulate_random, only_prediction)
txgnn  |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
txgnn  |   File "/opt/conda/lib/python3.11/site-packages/txgnn/utils.py", line 2018, in disease_centric_evaluation
txgnn  |     preds_, labels_, drug_idxs, drug_names = get_scores_disease(
txgnn  |                                              ^^^^^^^^^^^^^^^^^^^
txgnn  |   File "/opt/conda/lib/python3.11/site-packages/txgnn/utils.py", line 1959, in get_scores_disease
txgnn  |     [id2name_drug[idx2id_drug[i]] for i in drug_nodes],
txgnn  |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
txgnn  |   File "/opt/conda/lib/python3.11/site-packages/txgnn/utils.py", line 1959, in <listcomp>
txgnn  |     [id2name_drug[idx2id_drug[i]] for i in drug_nodes],
txgnn  |      ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
txgnn  | KeyError: 'DB00001'

kexinhuang12345 commented 2 weeks ago

Hey which disease id did you use?

l4b4r4b4b4 commented 2 weeks ago

Hey which disease id did you use?

5819.0.

Im currently trying to understand what dataset split I need to load in order to be able to make pedictions on a given disease id or a list of given ids.

I dont really understand the section in the readme on loading a disease_eval split and defining an id, before passing a list of disease ids to eval_disease_centric.

kexinhuang12345 commented 2 weeks ago

have you looked at this demo notebook? https://github.com/mims-harvard/TxGNN/blob/main/TxGNN_Demo.ipynb this runs

So the idea is that during training, the model will only fine-tune on a small set of known indications drug-disease pairs. During inference, we want to infer on all the drugs given a disease, like a small virtual screening. that is why no matter what is the data split, it is always useful to get the evaluation output.

l4b4r4b4b4 commented 2 weeks ago

have you looked at this demo notebook? https://github.com/mims-harvard/TxGNN/blob/main/TxGNN_Demo.ipynb this runs

So the idea is that during training, the model will only fine-tune on a small set of known indications drug-disease pairs. During inference, we want to infer on all the drugs given a disease, like a small virtual screening. that is why no matter what is the data split, it is always useful to get the evaluation output.

yes I have. I have a feeling this might be connected to the underlaying csv files not being the right ones.

The download links in TxData are not up to date anymore. I updated it to the following:

 data_download_wrapper(
            "https://dvn-cloud.s3.amazonaws.com/10.7910/DVN/IXA7BM/1805e679c4c-72137dbedbf1?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27kg.csv&response-content-type=text%2Fcsv&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20241007T075549Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3600&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20241007%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=0e04af054b75fd6928054d2209e3c1826d47fbf68f5a4898e783166684582cd8",
            os.path.join(self.data_folder, "kg.csv"),
        )
        data_download_wrapper(
            "https://dvn-cloud.s3.amazonaws.com/10.7910/DVN/IXA7BM/1805e69f00e-fcf0acc588bb.orig?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27nodes.csv&response-content-type=text%2Fcsv&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20241007T081502Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3600&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20241007%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=fb42acf98df1950a329a3555a9f13a10157ea2746c6e29ee87ff538e4ca2a1c5",
            os.path.join(self.data_folder, "nodes.csv"),
        )
        data_download_wrapper(
            "https://dvn-cloud.s3.amazonaws.com/10.7910/DVN/IXA7BM/1805e69de19-31377b621f41?response-content-disposition=attachment%3B%20filename%2A%3DUTF-8%27%27edges.csv&response-content-type=text%2Fcsv&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20241007T081358Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3600&X-Amz-Credential=AKIAIEJ3NV7UYCSRJC7A%2F20241007%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=36e39ecbb4e885d09a9b2a6945f8311af55cd72972b09fb7458ae3de95d35488",
            os.path.join(self.data_folder, "edges.csv"),
        )

Also why does it load and read in node.csv as tab delimited and not as commma separated?

l4b4r4b4b4 commented 2 weeks ago

ok, after some debuggin, here is the likely solution:

Pandas does not support .append on df from version 2.x
I refactored the random_fold and complex_disease_fold to use .concat

Now inference is successfull.

Will refactor the other .append instances on df in the codebase and check out what effect torch.compile has on inference time over the test set.

mims-harvard / TxGNN

KeyError: 'DB00001' in get_scores_disease function #13