Generate and vett input workbooks for 5 database-derived networks

kdahlquist commented 8 years ago

@bklein7 and @Nwilli31 will now work on generating 5 input workbooks for the 5 database-derived networks that the team started last year. The five networks are:

15-gene, 28 edge, dZAP1 family (originally from @khorstmann
15-gene, 28 edge, dHAP4 family (originally from @GraceJohnson and @maggie-oneil)
14-gene, 35 edge, dGLN3 family (originally from @tessaam)
wt family network (originally from @Nwilli31 )
dCIN5 family network (originally from @kjacks48)

Each will create 2 or 3 networks and then swap and double-check each other.

Instructions for how to format the input worksheets are found on the GRNmap wiki here.

We want to carefully check each part of the input workbook.

Go back to YEASTRACT and input the list of genes into their generate regulation matrix function and make sure that all the connections are right. If they are not, use the new ones and note the day/time it was generated.
Make sure that all genes are listed in alphabetical order (top to bottom, left to right) on each worksheet.
degradation rates (and derived production rates) will be the new ones that @Nwilli31 got from the Neymotin paper.
The expression data will come from the processed dataset @kdahlquist submitted to NCBI; a copy of it is found on the DahlquistLab repository here., the matrix worksheet
- include expression data for all strains that are represented in the network (wt, dCIN5, dGLN3, dHAP4, dHMO1, dZAP1)
Right now, GRNmap cannot handle missing values in the expression sheets. So you will need to carefully go through and highlight cells with missing data in a different color (so we can see them later) and then put the average of the remaining values for that gene and timepoint in the cell. You need to paste values over the formula so that the formula is not there when loading into GRNmap. Preserve the cell coloring so that we can see later where the missing values were. Use the same number of decimal points (4) as the rest of the data.

kdahlquist commented 8 years ago

Put the workbooks (in progress) in the Dahlquist Lab repository here: https://github.com/kdahlquist/DahlquistLab/tree/master/data/GRNmap_input_workbooks

Some instructions can also be found on the Microarray Analysis Workflow on OWW

We want to make a "gold standard" set of instructions here on how to make the input workbooks, so please feel free to update that page.

kdahlquist commented 8 years ago

Note when using the Neymotin data, use the Systematic name (e.g. YKL134C), to be absolutely sure you have the right gene. Sometimes synonyms get used and I can see that some gene names have problems due to some datatype conversion issues in Excel.

bklein7 commented 8 years ago

I have begun generating the input sheets for the dHAP4, dGLN3, and dZAP1 families of networks. The current versions of these input sheets can be found in the Dahlquist Lab Repository. My progress is summarized below:

-dHAP4: all sheets except for "dhap4_log2_expression" and "optimization_parameters" have been carefully generated. -dGLN3: the "network" sheet has been generated and approved. -dZAP1: the "network" sheet has been generated, but approval was not received. This network requires examination for validity before proceeding.

kdahlquist commented 8 years ago

I talked with @bklein7 about how to format the input sheets, recommending the following:

using the wiki here: https://github.com/kdahlquist/GRNmap/wiki/How-to-format-the-input-file-for-GRNmap-v1.4-and-above
using the following settings for the optimization_parameters sheet:
- fixb=0
- fixP=0
- estimate_params=1
- make_graphs=1
- for the simulation times, use 5 minute increments up to 60 minutes
do a database query to pull out the expression data instead of doing it by hand

bklein7 commented 8 years ago

Completed input sheets for the dHAP4, dGLN3, and dZAP1 families of networks are now available in the Dahlquist Lab Repository. Upon finishing these input sheets, I did run into some lingering questions:

On the "optimization_parameters" sheet, should we input the values for parameters "alpha" (A1) through "TolX" (A6) from the OWW Microarray Data Analysis Workflow or from the GitHub Test Files? Further, should we include a row of column headers in this sheet, as is displayed in the GitHub Test Files? Finally, was I correct in inputting the value for "alpha" as 0.002 as per #172 ?
For the expression data sheets, the wild type strain exhibits 5 instances of the 30 min. time point, whereas all the dHAP4, dGLN3, and dZAP1 strains exhibit 4 instances of the 30 min. time point. Is it necessary to delete one of the wild type 30 min. time points so that the separate strains have symmetric time point data?
This may seem rather minor, but I am curious if there is a preferred font style/size we should be using?

kdahlquist commented 8 years ago

I'll answer number 2 from your previous comment: leave all 5 instances there; there is no need to have the same number of replicates per timepoint.

For number 3, I don't think there is a preferred font/size. I do think it looks better if it is consistent from sheet to sheet. Microsoft changed the default from Arial 10 to something else (Calibri?) with a recent version change; I usually make my workbooks Arial 10, but I think it's better to just use something consistent.

For number 1, I think it's worth reviewing in the meeting. Alpha = 0.002 should be right for these networks because that is what we determined from the L-curve analysis last semester.

kdahlquist commented 8 years ago

See issue #119 for a screenshot of what that should look like. Will @bklein7 make sure our documentation is in conformance to this?

kdahlquist commented 8 years ago

@kdahlquist needs to review the dZAP1 network.

kdahlquist commented 8 years ago

@bklein7 and @Nwilli31 will swap when ready for cross-check.

kdahlquist commented 8 years ago

I have worked on this issue and here are my notes:

I completely regenerated the dZAP1 network. I have put the new network in the file 16-genes_27-edges_BK-KD-dZAP1-fam_Sigmoid_estimation.xlsx. I pasted the network into the "network" and "network_weights" sheets, and pasted the list of genes into the other worksheets, but did not put in the data. This network has 16 genes and 27 edges; the next gene to delete would have been MSN4 and it seemed arbitrary to delete it and keep MSN2, so I kept both.
I edited the file GRN_Gene_Lists.xlsx so that what is listed in there matches what is in the input workbooks for each strain. There was a discrepancy between this file and the wt input workbook, so I changed it to make it match the input workbook. I also copied and re-pasted from the other input workbooks (I didn't specifically check for problems). I also made a new worksheet that has all the strain gene lists next to each other for comparison.
I noticed that the input workbooks are missing some expression data. They need to have the expression data from wt, dCIN5, dGLN3, dHAP1, dHMO1, and dZAP1 if those genes are present in the network for that input workbook, not just the wt and the particular strain from which the network was derived. Most will have all 6 strains, unless one of the strain/genes does not appear in the network.
I don't know why the filenames say "no-strains-added" because we did add the deletion strains to these families. Does this mean something else? I'm changing the filenames to get rid of this.

So, there's a little more work to do to get the expression data for the rest of the strains, I'm afraid.

bklein7 commented 8 years ago

I have completed the regenerated dZAP1 network input sheet and added all missing expression data to the dHAP4 family input sheet. These updated input sheets have been uploaded to the Dahlquist Lab Repository. I have yet to add missing expression data for the dGLN3 family input sheet or update input sheet creation protocol. These tasks will be completed during the week of 10/17.

While cross-checking input sheets this week, I misadvised @Nwilli31 to delete expression data for networks other than wt or the particular strain from which the network was derived. She should have the missing data available in previous versions the input sheets for wt and dCIN5.

Nwilli31 commented 8 years ago

I've re-uploaded the input sheets with the new calculated degradation rates and the additional strain's expression data.

kdahlquist commented 8 years ago

Besides completing these workbooks, please make the unweighted GRNsight graphs for each, laying them out on one consistent grid. You will also do this after the first model runs from #265.

bklein7 commented 8 years ago

I have uploaded a new version of the dGLN3 family input sheet that includes the proper expression data to the Dahlquist Lab Repository. Thus, all five input sheets now include the previously missing expression data.

I briefly looked over @Nwilli31's input sheets and noticed her wt input sheet is missing expression data for dHMO1. Also, the gene names in the individual worksheet tabs should not be capitalized in the wt input sheet.

kdahlquist commented 7 years ago

I finally had a chance to re-create the dCIN5-family network. I am attaching a workbook that has the entire family of networks. Of interest are the last two sheets, the 17-gene_32-edge and 14-gene_25-edge networks. It turns out if you remove MCM1 from the 17-gene network, you also lose ACE2 and ZAP1 (the disconnected ones from before). I'm a little torn between using the 17- or 14-gene network, maybe we can run both? There are other minor differences between this and what Kayla had, so double-check everything when constructing the new network(s).

The file with the gene lists for the various networks will also need to be updated based on this.

dCIN4_network_family_20161114_KD.xlsx

kdahlquist commented 7 years ago

So @bklein7 has vetted the dCIN5 input workbooks, so this is closable.

kdahlquist / GRNmap

Generate and vett input workbooks for 5 database-derived networks #245