krishnanlab / geneplexus_app

BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

add model parameter selection guidance to help page #192

Closed billspat closed 2 years ago

billspat commented 2 years ago

From Chris/Arjun:

Text to be added and Changed on Help Page, insert the followingright under the choosing negative genes section.

Guidelines for Choosing the Best Job Settings

  1. The first step is choosing the network. If your gene set of interest comes from a curated database or it was generated while studying a specific process/pathway or disease, the best network to choose is STRING as this network is a highly curated network that uses prior knowledge from gene set databases in building the network. If you would like to only consider experimental interactions, then the network to use is STRING-EXP, and if you would further only like to consider physical interactions, choose BioGRID. The GIANT-TN network is best to use for two cases. First, since it offers the highest gene coverage, it enables the user to see predictions on many more understudied genes. Second, as GIANT-TN is a very dense network that does not directly incorporate gene set database information, this network performs well on larger gene sets that may be derived from high-throughput experiments.
  2. The next step is to choose the way the network is represented as features in the machine learning model. a. For BioGIRD, using Adjacency or Embedding usually results very similar performance b. For STRING-EXP, using Adjacency is usually the best feature representation. c. For GIANT-TN, Influence is usually the best representation. d. For STRING, if your input gene set size is smaller or if the gene set is similar to a specific biological process, then using Adjacency is the best choice. If the gene set size is larger or if the gene set corresponds to a complex phenotype, then use Influence.
  3. The next step is to choose the background used to determine the genes used as negative examples in the machine learning model. If your gene set corresponds to a biological process or pathway, choose GO. Instead, if it corresponds more closely to a disease or a complex phenotype, then choose DisGeNet.

The best way to determine if the chosen job options worked well is to look at the cross-validation score at the top of the results page. It could be useful to compare the cross validation score for a few different combinations of job options to help the user find the optimal set of options. Additionally, the figure below shows a summary of results generated from our recent work benchmarking GenePlexus and could help a user pick the best job parameters.

< right here have the figure currently in help with the caption >

Effect of the user-supplied gene set on model performance

The GenePlexus method has been extensively benchmarked on models that used between 10 and ~400 genes in the training set. We have found that, while there is some decrease in performance as the number of genes in the gene set increases (far left panel in the figure below), the major driving force is how “connected” the genes are in the network (center and right panel in the figure below). This is understandable as GenePlexus heavily leverages the underlying network when training the model. We have found that the GIANT-TN network shows the least decrease in performance as the gene set size increases and thus we recommend this network for use with larger gene sets like those that are generated directly from high-throughput experiments. As mentioned above, the cross-validation score is very useful in determining if a given model worked well on the user-supplied gene list.

<additional figure ( to be sent or attached) >

This figure shows results from the paper Supervised learning is an accurate method that looks into how different properties of the gene set affect the model performance. Here, the edge density is a measure of how connected a gene set is to itself in the network and segregation is how isolated the gene set is from the rest of the network. While there is decrease in performance as the number of genes is increased, the major driving force for the model is how connected the gene set is in the network. The results here are shown for the STRING network.

ChristopherMancuso commented 2 years ago

WebServer_Connectivity.pdf

ChristopherMancuso commented 2 years ago

that is the pdf, let me know if you need something else

billspat commented 2 years ago

It's a vector PDF so will work perfectly! Thank you!

jacobnewsted commented 2 years ago

Merged into dev