hetio / hetionet

Hetionet: an integrative network of disease
https://neo4j.het.io
257 stars 68 forks source link

Include BEL export of Hetionet #27

Closed cthoyt closed 4 years ago

cthoyt commented 4 years ago

In addition to the four formats with which you're distributing, would you consider supporting Biological Expression Language (BEL)?

If you're not familiar, BEL is another schema for storing causal, correlative, ontological, and associative relationships between biological entities. It's a good middle ground between the wild-west of RDF/Ontologies/OBO, where you can almost define anything you want, and having a strict/controlled vocabulary for expression. It has its own domain specific language, so it's a bit less accessible than JSON, XML, or TSV, but there is a growing number of applications that directly consume BEL.

I'm already writing a converter at https://github.com/pybel/pybel/pull/406, so to make the conversion you'd only have to run the following few lines of code.

from pybel import to_bel_script_gz
from pybel.io.hetionet import get_hetionet

output_path = ...
graph = get_hetionet()
to_bel_script_gz(graph, output_path)

If you might be interested in this, let me know. I'd be happy to take feedback. I will also post an issue or two I've been having as I'm going through the JSON export

For example, if you think keeping the conversion script somewhere within your organization rather than in PyBEL is a better idea, then I'd also be happy to make it as a PR there.

dhimmel commented 4 years ago

Hey @cthoyt, this is definitely of interest! We could add BEL as a format in this repo. The fact that you'd find it useful probably means others will as well.

I'm not familiar with BEL. So is the idea that the output will be a single file (script) where each row is an edge in the BEL assertion format?

Could you provide a couple snippets of the script to help me understand how it represents nodes and edges?

Does BEL have the ability to represent all the edge types in Hetionet, or would you be converting the edge types to a smaller vocabulary (increases, directlyIncreases, decreases, directlyDecreases)? If the later, then it seems that the resulting BEL network would not actually be the same as Hetionet, and we should make that clear in the docs.

cthoyt commented 4 years ago

I'm not familiar with BEL. So is the idea that the output will be a single file (script) where each row is an edge in the BEL assertion format?

Correct. BEL script is completely its own format. On each line, there's either a SET statement to keep track of annotations like the license, source, etc. akin to a finite state machine, an UNSET statement to remove an annotation, or a BEL statement that actually describes some biology.

Could you provide a couple snippets of the script to help me understand how it represents nodes and edges?

I've made the export already, which is quite big, so I've created a sample that shows 1 example of each metaedge in BEL. It's probably obvious what each relationship means, even though the Hetionet metaedge types aren't explicitly labeled. See the attached files (a table with just BEL statements and the example BEL script) and let me know if that's not the case.

Does BEL have the ability to represent all the edge types in Hetionet, or would you be converting the edge types to a smaller vocabulary (increases, directlyIncreases, decreases, directlyDecreases)? If the later, then it seems that the resulting BEL network would not actually be the same as Hetionet, and we should make that clear in the docs.

Yes, BEL can represent all edge types. In many cases, we can use BEL to represent even finer granularity, such as Gene-covaries-Gene since BEL types genes/RNAs/proteins separately. The same is true for edges that describe either the change in the abundance of something versus its activity.

However, BEL does not have a way to handle GO molecular functions nor cellular components (nor should it). This means all of the gene-GO term relationships for those have been removed. This is because each entity in BEL is either a physical thing or a process (like pathways/biological processes), so molecular functions are definitely out. I'm not quite so sure about cellular components, since it's pretty heterogeneous inside GO. There are protein complexes, cell parts, and all sorts of other things there, only a small subset of which can really be represented in BEL. Let me know what you think about this too, since the exclusion of these parts would indeed make this different from the full Hetionet.

hetionet_example.zip

dhimmel commented 4 years ago

Thanks @cthoyt for generating the .bel files. Helps me understand what is going on. So this is the file we'd point users to download for the BEL format.

For quick reference, I'm copying the BEL assertions from the TSV:

r(ncbigene:55327 ! LIN7C) positiveCorrelation pop(uberon:"UBERON:0002107" ! liver)
r(ncbigene:147184 ! TMEM99) correlation pop(uberon:"UBERON:0001831" ! "parotid gland")
r(ncbigene:153572 ! IRX2) negativeCorrelation pop(uberon:"UBERON:0001296" ! myometrium)
r(ncbigene:51705 ! EMCN) positiveCorrelation path(doid:"DOID:10652" ! "Alzheimer's disease")
r(ncbigene:162282 ! ANKFN1) correlation r(ncbigene:6098 ! ROS1)
pop(uberon:"UBERON:0002369" ! "adrenal gland") positiveCorrelation r(ncbigene:414189 ! AGAP6)
pop(uberon:"UBERON:0000474" ! "female reproductive system") correlation r(ncbigene:119180 ! LYZL2)
pop(uberon:"UBERON:0000996" ! vagina) negativeCorrelation r(ncbigene:25801 ! GCA)
pop(uberon:"UBERON:0001460" ! arm) association path(doid:"DOID:332" ! "amyotrophic lateral sclerosis")
p(ncbigene:9513 ! FXR2) partOf complex(p(ncbigene:9513 ! FXR2), p(ncbigene:10445 ! MCRS1))
p(ncbigene:7416 ! VDAC1) directlyIncreases complex(p(ncbigene:8344 ! HIST1H2BE), p(ncbigene:7416 ! VDAC1))
p(ncbigene:9353 ! SLIT2) partOf bp(go:"GO:0051384" ! "response to glucocorticoid")
p(ncbigene:356 ! FASLG) regulates p(ncbigene:1445 ! CSK)
p(ncbigene:348654 ! GEN1) association path(doid:"DOID:0050425" ! "restless legs syndrome")
r(ncbigene:2983 ! GUCY1B3) negativeCorrelation path(doid:"DOID:14330" ! "Parkinson's disease")
a(drugbank:DB00553 ! Methoxsalen) increases act(p(ncbigene:5829 ! PXN))
a(drugbank:DB00273 ! Topiramate) increases path(umls:C1142412 ! "Vasodilation procedure")
a(drugbank:DB01074 ! Perhexiline) decreases act(p(ncbigene:51116 ! MRPS2))
a(drugbank:DB00694 ! Daunorubicin) directlyDecreases act(p(ncbigene:4363 ! ABCC1))
a(drugbank:DB01058 ! Praziquantel) partOf complex(a(drugbank:DB01058 ! Praziquantel), p(ncbigene:64816 ! CYP3A43))
a(drugbank:DB00122 ! Choline) directlyIncreases complex(a(drugbank:DB00122 ! Choline), p(ncbigene:57153 ! SLC44A2))
a(drugbank:DB00936 ! "Salicylic acid") association a(drugbank:DB00627 ! Niacin)
a(drugbank:DB00956 ! Hydrocodone) isA a(drugcentral:N0000000174 ! "Opioid Agonists")
a(drugbank:DB00497 ! Oxycodone) directlyIncreases act(p(ncbigene:4988 ! OPRM1))
a(drugbank:DB00635 ! Prednisone) decreases path(doid:"DOID:6364" ! migraine)
path(doid:"DOID:12236" ! "primary biliary cirrhosis") association p(ncbigene:2805 ! GOT1)
path(doid:"DOID:219" ! "colon cancer") positiveCorrelation r(ncbigene:29080 ! CCDC59)
path(mesh:D003693 ! Delirium) association path(doid:"DOID:0050741" ! "alcohol dependence")
path(doid:"DOID:8778" ! "Crohn's disease") negativeCorrelation r(ncbigene:10628 ! TXNIP)
path(doid:"DOID:12930" ! "dilated cardiomyopathy") association pop(uberon:"UBERON:0002353" ! "atrioventricular bundle")
a(drugbank:DB00674 ! Galantamine) regulates p(ncbigene:1143 ! CHRNB4)

It looks like you've converted most edge types to a corresponding edge type in BEL, but the actual edge type label has been changed. For example, "Prednisone-palliates-migraine" has become "Prednisone-decreases-migraine". So would a hypothetical edge like "Prednisone-treats-migraine" also become "Prednisone-decreases-migraine" in BEL?

BEL does not have a way to handle GO molecular functions nor cellular components (nor should it).

It seems to me that BEL has a slightly different conceptual model than Hetionet. This is represented in the different labels for edge types and the inclusion of only "physical thing or a processes". For certain applications, BELs conceptualization may be superior, but we should make sure users are aware of the difference.

I am thinking we should update the download section of the Hetionet README to include a BEL download link in the list. The raw BEL files would be hosted on your repository that generated them to retain provenance and allow updates in your processing code. We can add a note that the BEL network is not an identical network in terms on edges and edge types/labels, but may be useful for users who want to use Hetionet in the context of BEL applications. @cthoyt does that make sense?

cthoyt commented 4 years ago

Sounds good! I made a new repo at https://github.com/pybel/hetionet-bel with the files and a readme explaining what's going on in BEL. Please take a look at it and let me know what you think. I added you as a collaborator so feel free to make changes / send a PR.

dhimmel commented 4 years ago

@cthoyt check out https://github.com/hetio/hetionet/commit/fa6182b8c0f18192f62313bb3534ed1bd9eb996a. I meant to make a pull request so you could review it but accidentally clicked the direct commit to master button! But if you make any comments on the commit, I am happy to address them.

Thanks a lot. I think this will be really helpful for bringing additional users to Hetionet.

I put BEL under a "Derivative Networks" section, which I thought made the most sense since the nodes and edges are not identical. Also gives us room to list other projects that build off of Hetionet.