dhimmel commented 5 years ago

We've now completed computing DWPCs and their corresponding DGP null distributions for all metapaths up to length 3 in Hetionet v1.0. We're working on building a database with this information in https://github.com/greenelab/hetmech-backend, so now is a good time to think about archival locations for our datasets.

HetMat-formatted hetnets

We invented the hetmat format for storing hetnets as on-disk matrices. I opened a PR to add these to the Hetionet GitHub repo at https://github.com/hetio/hetionet/pull/11. This repo is where we've stored Hetionet in the past, so it's the obvious place to put additional network files.

DWPC files

The DWPC files created in https://github.com/greenelab/hetmech/pull/142 are large (slightly under 200 GB for all of them). After contacting Zenodo, they were willing to increase our quota for this upload under the condition that we cite it in a peer-reviewed publication. I will post a draft of the Zenodo upload in the next comment.

dhimmel commented 5 years ago

Zenodo upload

Here are the current Zenodo files:

zenodo-files

Below is a draft description. @zietzm and @kkloste: would love your review & feedback on this documentation:

Hetionet v1.0 is a hetnet (heterogeneous network) with 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. This record contains computed path counts for Hetionet v1.0 for all metapaths (types of paths) up to length 3. Three types of data are included:

Path counts: Path counts measure the number of paths from a source node to a target node along a specified metapath. The path count is a special case of the degree-weighted path count (DWPC) metric where the damping exponent parameter is set to 0.0. Path counts for all source–target node combinations of a given metapath are stored in a matrix with source nodes as rows and target nodes as columns.
Degree-weighted path counts: DWPCs measure the abundance of paths from a source to target node along a given metapath (like path counts), but are adjusted for the degrees along the path such that paths through higher degree nodes are downweighted according to a damping parameter. The DWPCs here using a damping exponent of 0.5 and use the same matrix serialization as the path count datasets.
Degree-grouped permutation summaries: DGP summaries provide summary statistics of DWPCs computed on permuted hetnets. The permuted hetnets are derived from Hetionet v1.0 using the XSwap algorithm. This approach preserves node degree but randomizes edges to muddle their meaning. DWPCs were computed for 200 permuted networks and grouped by source–target node degree within each metapath. Permuted DWPCs are scaled by dividing by the unpermuted DPWC mean and then inverse-hyperbolic sine transformed. For each group of DWPCs (grouped across permutated hetnets and degree pairs), summary statistics were computed for the distribution. These statistics include the number of observed DWPCs, the number of nonzero DWPCs, the sum of the DWPCs, and the sum of squared DWPCs. These values are sufficient to calculate the parameters of a gamma-hurdle null DWPC distribution.

Data Format: the .zip files are HetMat archive files. This simply means that the directory structure and file formats of the archived files conform to the HetMat data structure for storing hetnets on disk. Matrices are stored as scipy.sparse .npz files. .npz is a numpy array serialization format that scipy uses to write sparse matrices to disk.

TSV files in this upload report information on the contents of the archives. The .zip-info.tsv files contain a list of all files included in the zip archives. metapath-dwpc-stats.tsv contains summary information on the unpermuted path counts and DPWCs. Note that results are archived by path length, such that all metapaths of length 1 are in a different archive than metapaths of length 2. Therefore, users who only need results for shorter metapaths, do not need to download the large archives for longer metapaths. There are 24 metapaths of length 1, 242 metapaths of length 2, and 1939 metapaths of length 3.

Source code: These datasets were computed by the bulk.ipynb notebook from greenelab/hetmech@34e95b9.

zietzm commented 5 years ago

This record contains computed path counts

Since path counts are just one of the features included, perhaps the sentence could more precisely read, "This record contains the computed connectivity metrics along all metapaths (types of paths) up to length 3."

serialization

This may just be my own lack of understanding, but I'm not familiar with this term. You mean the data format?

DGP summaries provide summary statistics of DWPCs computed on permuted hetnets.

I think this sentence is helpful. Could we add one before it that gives a higher-level description of DGP? For example, the following or something similar:

"Degree-grouped permutations (DGP) are used to compute the significance of DWPC values. Specifically, they are used to estimate a null distribution for each of the hetnet's DWPC values."

Permuted DWPCs are scaled by dividing by the unpermuted DPWC mean and then inverse-hyperbolic sine transformed.

Does this mean we are storing DWPCs in raw form, ie not arcsinh scaled? Also, noticed its DPWC instead of DWPC there.

For each group of DWPCs (grouped across permutated hetnets and degree pairs), summary statistics were computed for the distribution.

Two things: 1). I think you mean permuted, 2). I wish this sentence were a bit more clear. The indexing of interest in these files is by degree pair, while we aggregate across permuted hetnets. Could we say the following or something similar:

"Every degree pair along a given metapath has corresponding statistics that summarize its values across permuted hetnets."

Otherwise no comments! Looks great!

Unrelated note: Do we use/have a JSON schema to validate metagraph.json files?

dhimmel commented 5 years ago

Unrelated note: Do we use/have a JSON schema to validate metagraph.json files?

No. That could be a good way to codify our JSON metagraph and graph serialization formats.

dhimmel commented 5 years ago

@zietzm I applied your suggestions and rephrasing.

I am thinking of the following author list for the Zenodo dataset release:

zenodo-authorship

Tagging @zietzm @kkloste @naglem @bdsullivan @cgreene: if you approve, give this comment a thumbs up. If there is any issue, comment here or email me at daniel.himmelstein@gmail.com. This authorship is just for the uploaded datasets, so I only included individuals who were instrumental in the implementation of or discussion around the methods to produce the data.

@naglem can you let us know your ORCID? It's easy to sign up if you don't have one.

I added the following section to the Zenodo description:

Funding: This work was supported through a research collaboration with Pfizer Worldwide Research and Development. This work is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grants GBMF4552 and GBMF4560.

Also let me know if there are any requested changes to this funding statement.

naglem commented 5 years ago

@dhimmel thanks, sorry for not seeing this earlier. My ORCID is https://orcid.org/0000-0002-4677-7582. The Zenodo description sounds fine.

dhimmel commented 5 years ago

Thanks everyone for approving!

Dataset is published at https://zenodo.org/record/1435834 / https://doi.org/10.5281/zenodo.1435834

greenelab / connectivity-search-analyses

Archival deposits for HetMat archives and bulk DWPCs #148

HetMat-formatted hetnets

DWPC files

Zenodo upload