A few questions - Githubissues

seoanezonjic commented 5 years ago

Hi gzampieri This is a great job and I'm interested to use this code in some projects but I have a few problems:

The documentation makes reference to the script compute_kernel.py to perform the kernel computation but this script is missing.
The documentation describes briefly the input files, but a working sample (with the input and the correponding output files) would be very useful to check if the configuration is correct in my computer.
Inspecting the validation code, the unbiased testing is defined but the validation with the diseasome data is missing. I would like to know how you developed the ROC curves. I don't fully understand the process reading the original paper and I´m interested in replicate it.
Respecting the validation, is the unbiased way used in the benchmarkink with the HPO-gene relations? Thank you in advance Pedro Seoane

gzampieri commented 5 years ago

Dear Pedro,

thank you for your interest.

You are right, that script is missing. I'm going to upload it onto the repository along with some example as you suggest.

As for the tests with the OMIM annotations, I think you can ask a former collaborator of mine https://scholar.google.com/citations?hl=it&user=905E_HoAAAAJ.

The HPO annotations were used as additional test gene-disease associations in the "unbiased evaluation" presented in the paper.

Kind regards, Guido

seoanezonjic commented 5 years ago

Hi gzampieri Thank you for your quick response and the repository update. I'll use the kernel script soon to get the results in my project. The last commit do not include the sample files ¿Do you think to include them soon? If not, I'll toy with my data to adapt it to your tools. Also, I'll contact soon to Van in order to ask him about the diseasome validation. Thank you very much Pedro Seoane

gzampieri commented 5 years ago

Hi Pedro,

yes, I think probably in one or two days I should be able to add the rest.

Cheers, Guido

Il giorno lun 25 mar 2019 alle ore 14:19 seoanezonjic < notifications@github.com> ha scritto:

Hi gzampieri Thank you for your quick response and the repository update. I'll use the kernel script soon to get the results in my project. The last commit do not include the sample files ¿Do you think to include them soon? If not, I'll toy with my data to adapt it to your tools. Also, I'll contact soon to Van in order to ask him about the diseasome validation. Thank you very much Pedro Seoane

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gzampieri/Scuba/issues/1#issuecomment-476217560, or mute the thread https://github.com/notifications/unsubscribe-auth/ATLVntTyGAVXXgZofLDTNvAQ-uB-xTGgks5vaNsJgaJpZM4cB8jW .

seoanezonjic commented 5 years ago

Hi Guido No problem, I'll wait to your new commit then. Again, thank you very much for your effort Pedro Seoane

gzampieri commented 5 years ago

Hi Pedro,

sorry for the delay. I took the chance to update the code so that kernels are allowed to cover non-fully overlapping gene sets.

Best, Guido

seoanezonjic commented 5 years ago

Hi Guido No problem with the delay. I understand that takes time to check the code and to update the documentation. I'll try to use it and if in case of any problem I'll write you. Thanks a lot by your effort. Pedro Seoane

seoanezonjic commented 5 years ago

Hi Guido Some minor issues with the last commit (take into account that I work in a cluster all the time):

The Toy sample folder with a space character in the name is a pain in the commandline.
It would be great that the scripts (compute_kernel and launcher) would include the python shebang as #! /usr/bin/env python
The save_details method in input_output.py has identation issues. These details are the bunch that I have found when I have used your toy data. Also, it would be useful an example of the evaluation mode. With the toy data it is possible to do this? Thank you Pedro Seoane

gzampieri commented 5 years ago

Hi Pedro,

thanks for the comments.

I added all my gene lists for the validation, including 42 test cases. Each one focuses on the retrieval of a single gene (e.g. ENSG00000004961), which identifies the corresponding files in Validation_data/Training_lists_Ensembl and Validation_data/Candidate_lists_Ensembl. I hope it helps.

Best, Guido

seoanezonjic commented 5 years ago

Hi Guido Thank you very much for the last changes. The validation data will be very useful and this week I'll do several attempts with it. By the way, I would like to ask the following question: How do you compute the ROC curve in the unbiased way? I understand that the unlabelled genes are you True positives but, Which genes are you True negatives? Again, thank you very much by all your help Pedro Seoane

seoanezonjic commented 5 years ago

Hi Guido The last week, I have working with Scuba using the validation data of the last commit. I have tried to replicate some of your paper results, focused on the AUC per kernel of your supplemental files (table S2). I have done the following: 1) Download data from String 8.2 2) Using the aliases file from String 8.2 I have translated the ENSP ids to ENSG ids as training and candidate files are. 3) With the String data, I have build the matrix and the Index file (taking into account your file name conventions) 4) With the string matrix I have executed the following: compute_kernel.py --input-matrix validation_output/string_val_matrix --kernel-function MDK --output validation_output/string_val_Matrix_MDK2 -p 2 5) Once I have the kernel, I have executed the validation (candidates_unbiased_Ensembl.txt contains paths to _Candidate_listsEnsembl files and seeds_unbiased_Ensembl.txt contains paths to _Training_listsEnsembl files) as: launcher.py --kernel-list validation_output/kernel_list --disease-genes validation_output/seeds_unbiased_Ensembl.txt --candidate-genes validation_output/candidates_unbiased_Ensembl.txt --output validation_output/my_output -f 4 -i 11 --test unbiased In the output, Scuba using the kernel data complains about 6-7 training lists that miss 1 to 3 genes but in candidate lists, all the lists miss around 15-20 genes per list even 102 and 150 genes in two of them.

When I have the final results (my_output_results.txt) look as:

auc 0.57 auc30 0.203 candidates auc 0.52 candidates auc30 0.165 candidates lowest 100.0 candidates median 42.78 candidates std 30.91 candidates tpr01 0.0 candidates tpr05 2.4 candidates tpr10 11.9 candidates tpr15 14.3 candidates tpr20 16.7 candidates tpr30 38.1 lowest 98.92 median 38.74 std 28.57 tpr01 2.4 tpr05 9.5 tpr10 14.3 tpr15 21.4 tpr20 26.2 tpr30 35.7

The AUC scores are around ~0.2 less than the indicated in the S2 table ¿What is the setting that it is wrong in my execution? I took a look in the evaluation code in order to understand which parameter is affecting the results but calculation of the AUC and TPR are not very clear to me, I would like some clarification on how they are computed to understand the problem. Thank you in advance Pedro Seoane

gzampieri commented 5 years ago

Hi Pedro,

it's not totally clear what the cause might be from this information alone. Did you take the weighted version of the String network? A performance drop could arise when considering just binary links. Another, even concurrent, possibility is that some key genes are lost when processing the String data. How many genes do you have in the kernel matrix? It's normal to lose some seeds or candidates, but maybe a part of the network is missing somehow.

I have added my email on my profile, if you contact me I could share some of my data to help you solve this.

Best, Guido

gzampieri commented 5 years ago

P.S. For the AUC and TPR calculation, the 42 target genes are assumed to be the only positive examples (that we know of), while all other candidates are taken as negative. This differs from the training phase, during which known disease genes are the positive examples and all candidates are unlabelled.

seoanezonjic commented 5 years ago

Hi Guido Sorry for my late response. My input matrix has 17077 genes and the connections are weighted by their String score. One thing to note is that the main diagonal of the matrix is set to 0. Inspecting your kernel script, all the kernels set to 0 the main diagonal of the adjacency matrix and I think that this should not affect the results, but I mention this detail just in case. By the way, working with the kernels, the MDK (Markov diffusion kernel) gives different results using the adjacency matrix with the main diagonal set to 0 or not, whereas the other kernels are not affected. Is this the expected behavior? I'll send you and email in order to stay in touch and you can decide if close this issue or keep it open. Thank you Pedro Seoane

gzampieri / Scuba

A few questions #1