Closed LucaCappelletti94 closed 2 years ago
Hi Luca. The data provided by all the various formats is Hetionet version 1.0. So while some of the formats are newer, the actual data in them corresponds to the data that is in the TSV files from 2018-11-02.
Perfect, can I consider the hosting of these files here on GitHub a good permanent URL for FAIR hosting purposes? Is there some other preferred hosting?
I am the author of an open source graph processing library (which I am not linking as its README page currently needs significant revamping), and one of its features is that I try to provide users graphs from the literature in a single line of Python, hence the question.
BTW, I will be sending you a report on the graph soon (another of the perks of the library is an extensive human readable report with all the weird topological stuff that tends to sneak into most graphs in the literature).
Here follows the automatically generated report for HetioNet. For now, I have created it for the undirected version of the graph. Hopefully, it may contain helpful information.
The undirected multigraph HetioNet has 47.03K heterogeneous nodes and 2.25M heterogeneous edges. The graph contains 1.87K connected components (of which 1.87K are disconnected nodes), with the largest one containing 45.16K nodes and the smallest one containing a single node. The RAM requirements for the nodes and edges data structures are 6.04MB and 6.46MB, respectively.
The minimum node degree is 0, the maximum node degree is 23.88K, the mode degree is 2, the mean degree is 95.62 and the node degree median is 28.
The nodes with the highest degree centrality are Anatomy::UBERON:0000473 (degree 23.88K and node type Anatomy), Anatomy::UBERON:0000955 (degree 23.63K and node type Anatomy), Anatomy::UBERON:0002369 (degree 23.12K and node type Anatomy), Anatomy::UBERON:0000178 (degree 20.79K and node type Anatomy) and Anatomy::UBERON:0000948 (degree 19.70K and node type Anatomy).
The graph has 11 node types, of which the 10 most common are Gene (20.95K nodes, 44.53%), Biological Process (11.38K nodes, 24.20%), Side Effect (5.73K nodes, 12.19%), Molecular Function (2.88K nodes, 6.13%), Pathway (1.82K nodes, 3.87%), Compound (1.55K nodes, 3.30%), Cellular Component (1.39K nodes, 2.96%), Symptom (438 nodes, 0.93%), Anatomy (402 nodes, 0.85%) and Pharmacologic Class (345 nodes, 0.73%). The RAM requirement for the node types data structure is 3.01MB.
The graph has 24 edge types, of which the 10 most common are GpBP (1.12M edges, 24.88%), AeG (1.05M edges, 23.41%), Gr>G (527.97K edges, 11.74%), GiG (294.33K edges, 6.54%), CcSE (277.89K edges, 6.18%), AdG (204.48K edges, 4.55%), AuG (195.70K edges, 4.35%), GpMF (194.44K edges, 4.32%), GpPW (168.74K edges, 3.75%) and GpCC (147.13K edges, 3.27%). The RAM requirement for the edge types data structure is 35.98MB.
A topological oddity is a set of nodes in the graph that may be derived by an error during the generation of the edge list of the graph and, depending on the task, could bias the results of topology-based models. In the following paragraph, we will describe the detected topological oddities.
A singleton node is a node disconnected from all other nodes. We have detected 1.87K singleton nodes in the graph, involving a total of 1.87K nodes (3.98%). The detected singleton nodes are:
Gene::100128430 (node type Gene)
Gene::100129171 (node type Gene)
Gene::100129763 (node type Gene)
Gene::100130370 (node type Gene)
Gene::100130927 (node type Gene)
Gene::100131581 (node type Gene)
Gene::100133301 (node type Gene)
Gene::100287477 (node type Gene)
Gene::100288960 (node type Gene)
Gene::100505573 (node type Gene)
Gene::100528030 (node type Gene)
Gene::100996648 (node type Gene)
Gene::101060445 (node type Gene)
Gene::101927147 (node type Gene)
Gene::101927685 (node type Gene)
And other 1.86K singleton nodes.
Isomorphic groups are nodes with exactly the same neighbours and node types (if present in the graph). Nodes in such groups are topologically indistinguishable, that is swapping their ID would not change the graph topology. We have detected 729 isomorphic node groups in the graph, involving a total of 1.70K nodes (3.60%) and 57.53K edges (1.28%), with the largest one involving 54 nodes and 1.72K edges. The detected isomorphic node groups, sorted by decreasing size, are:
Group with 2 nodes (degree 860 and node type Biological Process): Biological Process::GO:0048870 and Biological Process::GO:0051674.
Group with 2 nodes (degree 847 and node type Biological Process): Biological Process::GO:0043207 and Biological Process::GO:0051707.
Group with 2 nodes (degree 838 and node type Biological Process): Biological Process::GO:0072359 and Biological Process::GO:0072358.
Group with 3 nodes (degree 531 and node type Biological Process): Biological Process::GO:0099537, Biological Process::GO:0099536 and Biological Process::GO:0007268.
Group with 2 nodes (degree 795 and node type Biological Process): Biological Process::GO:0006935 and Biological Process::GO:0042330.
Group with 2 nodes (degree 744 and node type Biological Process): Biological Process::GO:0044419 and Biological Process::GO:0044403.
Group with 2 nodes (degree 700 and node type Cellular Component): Cellular Component::GO:1990904 and Cellular Component::GO:0030529.
Group with 2 nodes (degree 658 and node type Cellular Component): Cellular Component::GO:0099513 and Cellular Component::GO:0099512.
Group with 2 nodes (degree 544 and node type Biological Process): Biological Process::GO:0010563 and Biological Process::GO:0045936.
Group with 2 nodes (degree 516 and node type Cellular Component): Cellular Component::GO:0005764 and Cellular Component::GO:0000323.
Group with 2 nodes (degree 446 and node type Molecular Function): Molecular Function::GO:0015267 and Molecular Function::GO:0022803.
Group with 2 nodes (degree 405 and node type Biological Process): Biological Process::GO:0043413 and Biological Process::GO:0006486.
Group with 2 nodes (degree 378 and node type Biological Process): Biological Process::GO:0050911 and Molecular Function::GO:0004984.
Group with 54 nodes (degree 14 and node type Gene): Gene::81061, Gene::81392, Gene::390199, Gene::254973, Gene::26534 and other 49.
Group with 2 nodes (degree 323 and node type Gene): Gene::8360 and Gene::8367.
And other 714 isomorphic node groups.
A dendritic star is a dendritic tree with a maximal depth of one, where nodes with maximal unique degree one are connected to a central root node with high degree and inside a strongly connected component. We have detected 366 dendritic stars in the graph, involving a total of 1.83K nodes (3.89%) and 1.83K edges (0.04%), with the largest one involving 45 nodes and 45 edges. The detected dendritic stars, sorted by decreasing size, are:
Dendritic star starting from the root node Compound::DB00882 (degree 399), and containing 45 nodes, with a maximal depth of 1, which are Side Effect::C0003492, Side Effect::C0023448, Side Effect::C0345907, Side Effect::C1696704 and Side Effect::C0033375. Its nodes have a single node type, which is Side Effect. Its edges have a single edge type, which is CcSE.
Dendritic star starting from the root node Compound::DB01238 (degree 884), and containing 41 nodes, with a maximal depth of 1, which are Side Effect::C1504561, Side Effect::C1141884, Side Effect::C1536916, Side Effect::C0856050 and Side Effect::C0854171. Its nodes have a single node type, which is Side Effect. Its edges have a single edge type, which is CcSE.
Dendritic star starting from the root node Compound::DB00188 (degree 1.14K), and containing 31 nodes, with a maximal depth of 1, which are Side Effect::C0024904, Side Effect::C0034544, Side Effect::C0221030, Side Effect::C0334634 and Side Effect::C0023860. Its nodes have a single node type, which is Side Effect. Its edges have a single edge type, which is CcSE.
Dendritic star starting from the root node Compound::DB00193 (degree 663), and containing 29 nodes, with a maximal depth of 1, which are Side Effect::C0856100, Side Effect::C0549635, Side Effect::C0435632, Side Effect::C0855618 and Side Effect::C1096368. Its nodes have a single node type, which is Side Effect. Its edges have a single edge type, which is CcSE.
Dendritic star starting from the root node Compound::DB01590 (degree 929), and containing 25 nodes, with a maximal depth of 1, which are Side Effect::C0151680, Side Effect::C0392175, Side Effect::C0473124, Side Effect::C0860987 and Side Effect::C2609269. Its nodes have a single node type, which is Side Effect. Its edges have a single edge type, which is CcSE.
Dendritic star starting from the root node Compound::DB00624 (degree 529), and containing 24 nodes, with a maximal depth of 1, which are Side Effect::C0860851, Side Effect::C3160830, Side Effect::C0349464, Side Effect::C0877497 and Side Effect::C0856152. Its nodes have a single node type, which is Side Effect. Its edges have a single edge type, which is CcSE.
And other 360 dendritic stars.
A tendril is a path starting from a node of degree one, connected to a strongly connected component. We have detected 300 tendrils in the graph, involving a total of 302 nodes (0.64%) and 302 edges. The detected tendrils, sorted by decreasing size, are:
Tendril starting from the root node Gene::1576 (degree 641 and node type Gene), and containing 2 nodes, with a maximal depth of 2, which are Compound::DB01256 (node type Compound) and Pharmacologic Class::N0000175433 (node type Pharmacologic Class). Its nodes have 2 node types, which are Pharmacologic Class and Compound. Its edges have 2 edge types, which are PCiC (2 edges) and CbG.
Tendril starting from the root node Compound::DB00766 (degree 28 and node type Compound), and containing 2 nodes, with a maximal depth of 2, which are Pharmacologic Class::N0000000202 (node type Pharmacologic Class) and Compound::DB01606 (node type Compound). Its nodes have 2 node types, which are Compound and Pharmacologic Class. Its edges have a single edge type, which is PCiC.
Tendril starting from the root node Compound::DB00187 (degree 155), and containing a single other node, Side Effect::C0340305. Its nodes have a single node type, which is Side Effect. Its edges have a single edge type, which is CcSE.
Tendril starting from the root node Compound::DB00285 (degree 602), and containing a single other node, Side Effect::C0029464. Its nodes have a single node type, which is Side Effect. Its edges have a single edge type, which is CcSE.
Tendril starting from the root node Compound::DB00388 (degree 105), and containing a single other node, Pharmacologic Class::N0000009917. Its nodes have a single node type, which is Pharmacologic Class. Its edges have a single edge type, which is PCiC.
Tendril starting from the root node Compound::DB00594 (degree 105), and containing a single other node, Side Effect::C0476289. Its nodes have a single node type, which is Side Effect. Its edges have a single edge type, which is CcSE.
And other 294 tendrils.
can I consider the hosting of these files here on GitHub a good permanent URL for FAIR hosting purposes
I think so, you could consider versioning the GH URLs with a commit hash for extra safety. You also could note the Zenodo deposit of this repo at https://doi.org/10.5281/zenodo.268568, although that doesn't contain all of the latest formats.
one of its features is that I try to provide users graphs from the literature in a single line of Python
Cool. By the way, the JSON format is slightly preferred to the TSV format for better metadata. But either is fine!
Here follows the automatically generated report for HetioNet
Nice! Thanks for sharing this. Have you considered also including node names/labels instead of just identifiers? Would make a little easier to interpret.
I cannot use the names columns as node names as it contains duplicate values.
Yeah that's annoying but probably the correct solution is to use node identifiers to generate the statistics, but then to also include the node labels in auto generated text. I'm assuming most of the graphs you load have names in addition to identifiers.
Hello,
While I see that a TSV version is available it appears to be much older than the most up-to-date release. Can it be possibly updated?
Thank you, Luca