tomlue commented 2 years ago

Tasks

- [ ] add property-records to graph (or explain why not)
- [x] migrate to factor graph (or explain why not)

A. Add property-record nodes that assign a property+value to other nodes and reference sources? It would:

allow many collaborators on the graph.
allow fine graned access control on uploaded properties.
allows storage of conflicting values
allows storage of multiple values for the same property from different sources

B. Should we use 'factor relationships'? Our relationships don't handle multiple inputs/outputs, can't be the target of other relationships and can't be associated with property records. This approach won't work for anything but trivial relationships. For example the following statements cannot be well captured:

a. Gefitinib promotes the reaction "EGFR protein binds to EGFR protein" b. [TGFA protein binds to and affects the activity of EGFR protein] which decreases susceptibility to nickel sulfate c. X binds to EGFR with affinity Z

In a,b we want to say that something transforms the relationship between two other things. In c, we want to add a quantitative value to a relationship, but there may be disagreements about what the affinity is.

Factor relationships solve this, for example:

factor-node:
   label: binds-to
   identifier: 100
   ligand: TGFA
   target: EGFR

variable-node:
   label: protein
   name: TGFA

variable-node:
   label: protein
   name: EGFR

relationship:
   input: TGFA
   output: factor_node_100

relationship:
   input: factor_node_100
   output: EGFR

...

This approach creates a bipartite factor graph where:

Every node is either a variable or a factor.
Variables can relate to Factors but not other variables
Factors can relate to variables but not other factors

Factor graphs should be able to capture very complex relationships, and support advanced modeling methods.

Maddocent commented 2 years ago

The way I see this, and thanks for these helpful examples @tomlue, is that indeed there is no easy way to capture this is direct node - edge - node realationships like we have been trying so far. I see no reason why not to adopt this. But indeed, some more feedback from others: @Huan-Yang @amdehaan , would be great!

Huan-Yang commented 2 years ago

Thanks for doc and the example @tomlue . Just curious about two things (i) how to incorporate Z in c; (ii) for "B. Should we use 'factor relationships'?", does the proposed approach also work for the situation with five species (a,b,c,d,e): for a reaction a + b -> c + d , and e promotes the reaction?

MarieCo commented 2 years ago

I like the idea of a factor graph, thanks for the examples @tomlue. A few thoughts/questions:

As I understand it, the node types would then be "variable" or "factor" and then any further attribute passed as a property of the node (e.g.: variable node {type: chemical, ChemicalName: "Ethanol", ChemicalID:"..."})? But then these properties would depend on the "first" property: type. Do you know if there is a way to constrain that, for example that you can't give (by error) a property "ChemicalID" to a gene? Is that what you mean by "allow fine grain control on uploaded properties"?
Similarly, is there a way to restrict the adding of a node so that you for example only add a factor node if you also document where you found the information?
How do we want to deal with synonyms? For example for chemicals, do we want to have a property “synonyms” which lists them? Or have all synonyms be separate variable nodes, but then I am not sure how to link them together (multiple factor-nodes “is_synonym”?).
Related to @Huan-Yang question: you mention that our relationships do not handle multiple inputs/outputs, but it's not clear to me how you would deal with that with a factor graph. Do you then use all possible combinations? For example for a+b -> c+d you would have 4 relationships: a->c, a->d, b->c, b->d? You could have reactant 1 (a), reactant 2 (b), product 1 (c), product 2 (d) in the factor node I suppose - but that still does not solve the multiple inputs issue for the relationship, or am I missing something?

I realise that not everything is directly related to this issue and some are more general questions, so please let me know if I should move some to a different (new?) issue.

JJSirius commented 2 years ago

I hesitated to recommend a factor graph due to the increase of nodes and relationships, but now I think the queries will be cleaner and improve the performance. https://neo4j.com/developer/modeling-designs/

tomlue commented 2 years ago

Hey neat, a hyperedge in those docs seems like the same thing as a factor.

tomlue commented 2 years ago

@MarieCo responded to your queries (1 through 4) below. Created issues for proposed solutions.

How to prevent bad property assignments? (query 1 & 2) We can restrict new label and property creation with Neo4j fine-grained-access control and constraints.

[ ] Define labels, label-properties, relationships, and relationship domain-range in migration scripts.
[ ] Restrict all creation of new labels and properties to migration scripts.

How do we want to deal with synonyms? (query 3) A possible solution, where * indicates any node, is:

Variable Node `Synonym`
Function Node `Synonym_assignment`
Relation Edge `has_synonym : Synonym_assignment -> Synonym` 
Relation Edge `has_target : Synonym_assignment -> *`

This adds one layer to

Variable Node `Synonym`
Relation Edge `has_synonym : * -> Synonym`

The latter generalizes more and would allow additional relations like confidence or source.

There's a lot to think about here, I'm not ready to assign issues on it. A paper on the RDF OWL property owl:sameAs shows how hard this can be, When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web. Identity links are really hard to get right, and synonyms are a related concept.

How do factors handle multiple relationships i.e a+b -> c+d ? (query 4) F(A,B,C,D) is a four argument factor node (maybe need dif. name, see 2) where

a -A-> F
b -B-> F
c -C-> F
d -D-> F

F above doesn't really have an input and output. The edges are better defined as undirected. The relationship labels tell you what role each related node plays in the F instance.

You can also think of F as a table associating instances of A, B, C and D.	A	B	C	D
a1	b32	c9	d5
a2	b12	c3	d8

Some extra thoughts:

I am not certain that a factor graph is the best solution, but it does enforce lightweight relationships.
The above isn't really a factor graph. Function graph might be better and call factors function-nodes or fnodes.
@JJSirius reference to Neo4j hyperedge indicates that transforming edges into nodes is a common thing.
https://medium.com/neo4j/graph-data-modeling-all-about-relationships-5060e46820ce @Maddocent referenced relationship reification which helps to motivate lightweight edeges as well
https://www.w3.org/TR/rdf-schema/#ch_properties The RDF-Schema documentaiton is dense, but worth reading carefully. RDF statements link a subject, predicate and object. The subject and object are classes. The predicate is a property. I didn't realize previously that RDF properties are meant to be quite light weight.
https://docs.ropensci.org/rdflib/articles/rdf_intro.html This helped me see RDF as a way of collapsing tabular data to a special long form. It also pretty much sold me on the idea of RDF maybe actually being useful (maybe).

ontox-hu / aspis4j

property-record nodes & relationship factors #28

Tasks