Closed dhimmel closed 4 years ago
Here are the current Zenodo files:
Below is a draft description. @zietzm and @kkloste: would love your review & feedback on this documentation:
Hetionet v1.0 is a hetnet (heterogeneous network) with 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. This record contains computed path counts for Hetionet v1.0 for all metapaths (types of paths) up to length 3. Three types of data are included:
Data Format: the .zip files are HetMat archive files. This simply means that the directory structure and file formats of the archived files conform to the HetMat data structure for storing hetnets on disk. Matrices are stored as scipy.sparse .npz files. .npz is a numpy array serialization format that scipy uses to write sparse matrices to disk.
TSV files in this upload report information on the contents of the archives. The .zip-info.tsv files contain a list of all files included in the zip archives. metapath-dwpc-stats.tsv contains summary information on the unpermuted path counts and DPWCs. Note that results are archived by path length, such that all metapaths of length 1 are in a different archive than metapaths of length 2. Therefore, users who only need results for shorter metapaths, do not need to download the large archives for longer metapaths. There are 24 metapaths of length 1, 242 metapaths of length 2, and 1939 metapaths of length 3.
Source code: These datasets were computed by the bulk.ipynb notebook from greenelab/hetmech@34e95b9.
This record contains computed path counts
Since path counts are just one of the features included, perhaps the sentence could more precisely read, "This record contains the computed connectivity metrics along all metapaths (types of paths) up to length 3."
serialization
This may just be my own lack of understanding, but I'm not familiar with this term. You mean the data format?
DGP summaries provide summary statistics of DWPCs computed on permuted hetnets.
I think this sentence is helpful. Could we add one before it that gives a higher-level description of DGP? For example, the following or something similar:
"Degree-grouped permutations (DGP) are used to compute the significance of DWPC values. Specifically, they are used to estimate a null distribution for each of the hetnet's DWPC values."
Permuted DWPCs are scaled by dividing by the unpermuted DPWC mean and then inverse-hyperbolic sine transformed.
Does this mean we are storing DWPCs in raw form, ie not arcsinh scaled? Also, noticed its DPWC instead of DWPC there.
For each group of DWPCs (grouped across permutated hetnets and degree pairs), summary statistics were computed for the distribution.
Two things: 1). I think you mean permuted, 2). I wish this sentence were a bit more clear. The indexing of interest in these files is by degree pair, while we aggregate across permuted hetnets. Could we say the following or something similar:
"Every degree pair along a given metapath has corresponding statistics that summarize its values across permuted hetnets."
Otherwise no comments! Looks great!
Unrelated note: Do we use/have a JSON schema to validate metagraph.json files?
Unrelated note: Do we use/have a JSON schema to validate metagraph.json files?
No. That could be a good way to codify our JSON metagraph and graph serialization formats.
@zietzm I applied your suggestions and rephrasing.
I am thinking of the following author list for the Zenodo dataset release:
Tagging @zietzm @kkloste @naglem @bdsullivan @cgreene: if you approve, give this comment a thumbs up. If there is any issue, comment here or email me at daniel.himmelstein@gmail.com
. This authorship is just for the uploaded datasets, so I only included individuals who were instrumental in the implementation of or discussion around the methods to produce the data.
@naglem can you let us know your ORCID? It's easy to sign up if you don't have one.
I added the following section to the Zenodo description:
Funding: This work was supported through a research collaboration with Pfizer Worldwide Research and Development. This work is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grants GBMF4552 and GBMF4560.
Also let me know if there are any requested changes to this funding statement.
@dhimmel thanks, sorry for not seeing this earlier. My ORCID is https://orcid.org/0000-0002-4677-7582. The Zenodo description sounds fine.
Thanks everyone for approving!
Dataset is published at https://zenodo.org/record/1435834 / https://doi.org/10.5281/zenodo.1435834
We've now completed computing DWPCs and their corresponding DGP null distributions for all metapaths up to length 3 in Hetionet v1.0. We're working on building a database with this information in https://github.com/greenelab/hetmech-backend, so now is a good time to think about archival locations for our datasets.
HetMat-formatted hetnets
We invented the hetmat format for storing hetnets as on-disk matrices. I opened a PR to add these to the Hetionet GitHub repo at https://github.com/hetio/hetionet/pull/11. This repo is where we've stored Hetionet in the past, so it's the obvious place to put additional network files.
DWPC files
The DWPC files created in https://github.com/greenelab/hetmech/pull/142 are large (slightly under 200 GB for all of them). After contacting Zenodo, they were willing to increase our quota for this upload under the condition that we cite it in a peer-reviewed publication. I will post a draft of the Zenodo upload in the next comment.