hetio / hetionet

Hetionet: an integrative network of disease
https://neo4j.het.io
264 stars 69 forks source link

How to resolve pathway identifiers? #31

Open cthoyt opened 4 years ago

cthoyt commented 4 years ago

After a closer look, I'm having issues with the data source used for pathways.

Pathway::PC7_2008   BMAL1:CLOCK,NPAS2 activates circadian gene expression   Pathway

Should correspond to the reactome pathway, https://reactome.org/content/detail/R-HSA-1368108, but it's not clear what this identifier is. I might guess "Pathway Commons 7"

Later, wikipathways identifiers (plus revisions) are used for other pathways

Pathway::WP516_r71358   Hypertrophy Model   Pathway

It's nice to have the exact revision _r71358, but this isn't what's necessary to resolve this pathway and merge with other resources.

I'm also not sure what the actionable item is for this. I don't think you would update the source data, would you? Are there plans for a Hetionet v2.0 that will include some of the other new updates?

dhimmel commented 4 years ago

For PC7_2008, here's what I found from https://neo4j.het.io/:

MATCH (n:Pathway)
WHERE n.identifier = 'PC7_2008'
RETURN n
<id>:37740
identifier:PC7_2008
license:CC BY 4.0
name:BMAL1:CLOCK,NPAS2 activates circadian gene expression
source:Reactome via Pathway Commons

Pathway resources were combined in this notebook.

The Pathway Commons raw data we used is Pathway Commons.7.All.GSEA.hgnc.gmt, which includes the line:

9606: BMAL1:CLOCK,NPAS2 activates circadian gene expression datasource: reactome; organism: 9606; id type: hgnc symbol  NAMPT   PPARA   CCRN4L  HELZ2   RORA    NR3C1   CHD9    NPAS2   CRY2    NR1D1   SMARCD3 SERPINE1    PER2    PER1    ARNTL2  BHLHE40 TGS1    BHLHE41 CRY1    TBL1XR1 AVP RXRA    CREBBP  ARNTL   F7  PPARGC1A    NCOA1   HDAC3   EP300   NCOA2   DBP NCOA6   TBL1X   CARM1   NCOR1   CLOCK   MED1

So it looks like this file lacked actual pathway identifiers so I assigned identifiers as incrementing integers prepended with PC7. Definitely not a good system! Not sure if Pathways Commons now provides pathway IDs for the source database in their data exports.

It's nice to have the exact revision _r71358, but this isn't what's necessary to resolve this pathway and merge with other resources.

Agree this would be best as a separate revision property rather than as part of the node identifier.

I'm also not sure what the actionable item is for this. I don't think you would update the source data, would you? Are there plans for a Hetionet v2.0 that will include some of the other new updates?

I'm not currently working on Hetionet v2.0. If someone wants to take the lead, I'd be happy to advise and support. There's lot's of low hanging fruit like updating resources and adding more properties (like CURIEs and URLs where missing).

I think one actionable item from your comment is that it would be nice to have a mapping for each Hetionet v.1.0 node to a CURIE for that node. Nodes would have an extra curie property, so it would be very backwards compatible. In the case of the Pathway Commons nodes, this might actually be a bit annoying to generate.