kdahlquist / GRNmap

Gene Regulatory Network modeling and parameter estimation
BSD 3-Clause "New" or "Revised" License
4 stars 3 forks source link

Compute graph statistics for db-derived networks with Gephi #290

Closed kdahlquist closed 7 years ago

kdahlquist commented 7 years ago

@khorstmann and @maggie-oneil will compute graph statistics for the 6 networks using Gephi.

bklein7 commented 7 years ago

Here is a direct link to the input (and output) workbooks for the 5, "15"-gene networks: https://github.com/kdahlquist/DahlquistLab/tree/master/data/GRNmap_input_workbooks. This includes two separate CIN5 networks.

kdahlquist commented 7 years ago

Comment from @maggie-oneil: Looked into the outputs we were getting for the strongly connected component statistic, and basically the number widely varies because it's looking at the maximum number of vertices that are needed to have directed connections between two vertices. The sources I looked at describe it as being the "maximal set of vertices C ⊆ V such that for every pair of vertices u and v, there is a directed path from u to v and a directed path from v to u." https://goo.gl/Cvnv5K

Also looked more into the algorithm Gephi is using to calculate this statistic, and it's called Tarjan's strongly connected component algorithm. Here's the paper written by Tarjan on his algorithm - https://goo.gl/9SsU4V, and here's a link to the code Gephi uses, which can be found on GitHub https://goo.gl/lzWSLp

bklein7 commented 7 years ago

The following article, discussed during our 1/19 meeting, assesses the use of different centrality graph statistics for isolating gene-disease associations in a data-mined GRN. The authors hypothesized that "central genes" in the network would be more likely to be associated with the diseases they studied.

http://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/24/13/10.1093/bioinformatics/btn182/2/btn182.pdf?Expires=1485219542&Signature=gGDiM3clu23ze4fdk0JUNgY6P7EYBqZW7L4b1-yCWHF8tZw1lAUnrU2MgOSKW0Zu6oxN3BSd94RMx7OLhOFZzLaffPPX2~277J65ylf1lyqhOU84AIBWYeOUNi~DIfX5il5pInjyuMoiA5~Ar9UblNpKPAWNZD6Ai5IXS47BI1px-wx9Y1G7jE0sC6fWJIN5AUU9ryX0ZbQa1dpkw8w5oJEr1B-zatWPYVK39MIAUWt5wvRqM0So8bsE07yqoXx0yJCKvFh5~9E98EKbNPGZdV49Q-Ix6LtxcHCfoGzxq4tfJijzzwOCb9bvVD3KQQ7PfS9NXz5EeA2wstQPT5ZStg__&Key-Pair-Id=APKAIUCZBIA4LVPAVW3Q

kdahlquist commented 7 years ago

Here is a link to the PowerPoint that has all the graph stats for each of the six networks. http://www.openwetware.org/images/e/e2/GephiOutputsFall16.ppt

kdahlquist commented 7 years ago

@khorstmann --do you have the Excel files with the Gephi results you guys ran on all the networks at the end of last semester? Can you post them somewhere and put a link on a comment on this issue? Thanks.

khorstmann commented 7 years ago

15-genes_28-edges_BK-dHAP4-fam_Sigmoid_estimation_GephiOutput.xlsx 14-genes_35-edges_BK-dGLN3-fam_Sigmoid_estimation_GephiOutput.xlsx 14-gene_25-edges_NW_dCIN5_fam_Sigmoid_estimation_GephiOutputs.xlsx 16-genes_27-edges_BK-KD-dZAP1-fam_Sigmoid_estimation_output_Gephi_Output.xlsx 16-genes_36-edges_NW-wt-fam_Sigmoid_estimation_output_Gephi_Output.xlsx

These should be all the Gephi outputs from last semester. Maggie and I will personally be focusing on HAP4

maggie-oneil commented 7 years ago

Here is the completed distance matrix along with sheets of the original adjacency matrix and weighted network I based it off of. 95% sure it's correct but some of the arrows were difficult to identify so should be double checked before being used. Distance-Matrix_From-Unweighted-Adjacency-15-genes_28-edges_GJ-dHAP4-fam_strains-added_Sigmoid_Estimation.xlsx

kdahlquist commented 7 years ago

@khorstmann will do further work on #325; this issue has been renamed to refer to the db-derived networks only and assigned solely to @maggie-oneil as discussed at the meeting. We are going to give each person on the team his or her own issues to make it easier to track what is going on with more granularity.

Only do unweighted networks at this point because we are still unsure how Gephi is incorporating the weight information into the graph statistics.

Since you have already run the stats in Gephi, go ahead and start compiling some descriptive statistics.

@khorstmann and @maggie-oneil should make an Excel spreadsheet analogous to what @bklein7 did for the weight parameters. I.e., list all the 28 genes for all the networks and then make columns for each of the graph stats for each of the random networks.

A workbook with the list of genes already exists here: https://github.com/kdahlquist/DahlquistLab/blob/master/data/GRNmap_input_workbooks/GRN_Gene_Lists.xlsx

Then you can start doing some descriptive statistics, like mean, median, max, min, standard deviation and we can more easily compare the data across networks.

We will also think about ways to plot the data.

kdahlquist commented 7 years ago

At the meeting @maggie-oneil said that she was essentially done with compiling the graph stats for the db1-6 networks. She will go first at the next meeting so we can review. Please post a link to the file on this issue.

maggie-oneil commented 7 years ago

Completed compilation of raw Gephi stats. Output can be found here https://github.com/kdahlquist/DahlquistLab/blob/master/data/15-gene_networks_analysis/Gephi_node_stats_all_6_db_MO_02232017.xlsx

kdahlquist commented 7 years ago

@maggie-oneil, As discussed in the meeting, we are interested in the per gene statistics, not just the overall stats on the totality of the edge weights.

What we want to know is what is the sum and average of the in-degree weights (the rows) and the out-degree weights (the columns). You basically have this information already for the sum, you just need to organize it. For computing the average, we need to be careful and not include the zeros in the average, just the positive and negative numbers. You can probably tweak the Excel equation with an "if" statement to include just the non-zero weights in the computation.

Once you have that, please post the results to the repo and make a comment on the issue so @bklein7 knows where they are.

We are interested in this because we would hope to see a relationship where if a gene was up-regulated, the incoming weights would either sum or average (or both) to be positive, and conversely if the gene is down-regulated, the incoming weights would either sum or average to be negative.

We are also interested in knowing the relationship between these values and the MSE's for genes.

kdahlquist commented 7 years ago

I just realized that the comment above really belongs to issue #328; I'm going to copy it over there.

What I'm wondering now, @maggie-oneil, is about the stats you compiled for the db-derived networks--were these for weighted or unweighted networks?

kdahlquist commented 7 years ago

This is related to #328. I believe that this has been done and compiled by @maggie-oneil. She should comment on this issue with the link to both the raw and complied data files and we can close this one.

kdahlquist commented 7 years ago

they are there, so closing.