Open tomlue opened 2 years ago
The way I see this, and thanks for these helpful examples @tomlue, is that indeed there is no easy way to capture this is direct node - edge - node realationships like we have been trying so far. I see no reason why not to adopt this. But indeed, some more feedback from others: @Huan-Yang @amdehaan , would be great!
Thanks for doc and the example @tomlue . Just curious about two things (i) how to incorporate Z in c; (ii) for "B. Should we use 'factor relationships'?", does the proposed approach also work for the situation with five species (a,b,c,d,e): for a reaction a + b -> c + d , and e promotes the reaction?
I like the idea of a factor graph, thanks for the examples @tomlue. A few thoughts/questions:
As I understand it, the node types would then be "variable" or "factor" and then any further attribute passed as a property of the node (e.g.: variable node {type: chemical, ChemicalName: "Ethanol", ChemicalID:"..."})? But then these properties would depend on the "first" property: type. Do you know if there is a way to constrain that, for example that you can't give (by error) a property "ChemicalID" to a gene? Is that what you mean by "allow fine grain control on uploaded properties"?
Similarly, is there a way to restrict the adding of a node so that you for example only add a factor node if you also document where you found the information?
How do we want to deal with synonyms? For example for chemicals, do we want to have a property “synonyms” which lists them? Or have all synonyms be separate variable nodes, but then I am not sure how to link them together (multiple factor-nodes “is_synonym”?).
Related to @Huan-Yang question: you mention that our relationships do not handle multiple inputs/outputs, but it's not clear to me how you would deal with that with a factor graph. Do you then use all possible combinations? For example for a+b -> c+d you would have 4 relationships: a->c, a->d, b->c, b->d? You could have reactant 1 (a), reactant 2 (b), product 1 (c), product 2 (d) in the factor node I suppose - but that still does not solve the multiple inputs issue for the relationship, or am I missing something?
I realise that not everything is directly related to this issue and some are more general questions, so please let me know if I should move some to a different (new?) issue.
I hesitated to recommend a factor graph due to the increase of nodes and relationships, but now I think the queries will be cleaner and improve the performance. https://neo4j.com/developer/modeling-designs/
Hey neat, a hyperedge in those docs seems like the same thing as a factor.
@MarieCo responded to your queries (1 through 4) below. Created issues for proposed solutions.
How to prevent bad property assignments? (query 1 & 2) We can restrict new label and property creation with Neo4j fine-grained-access control and constraints.
How do we want to deal with synonyms? (query 3) A possible solution, where * indicates any node, is:
Variable Node `Synonym`
Function Node `Synonym_assignment`
Relation Edge `has_synonym : Synonym_assignment -> Synonym`
Relation Edge `has_target : Synonym_assignment -> *`
This adds one layer to
Variable Node `Synonym`
Relation Edge `has_synonym : * -> Synonym`
The latter generalizes more and would allow additional relations like confidence or source.
There's a lot to think about here, I'm not ready to assign issues on it. A paper on the RDF OWL property owl:sameAs
shows how hard this can be, When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web. Identity links are really hard to get right, and synonyms are a related concept.
How do factors handle multiple relationships i.e a+b -> c+d ? (query 4)
F(A,B,C,D)
is a four argument factor node (maybe need dif. name, see 2) where
F above doesn't really have an input and output. The edges are better defined as undirected. The relationship labels tell you what role each related node plays in the F instance.
You can also think of F as a table associating instances of A, B, C and D. | A | B | C | D |
---|---|---|---|---|
a1 | b32 | c9 | d5 | |
a2 | b12 | c3 | d8 |
Some extra thoughts:
statements
link a subject, predicate and object. The subject and object are classes. The predicate is a property. I didn't realize previously that RDF properties are meant to be quite light weight.
Tasks
A. Add property-record nodes that assign a property+value to other nodes and reference sources? It would:
B. Should we use 'factor relationships'? Our relationships don't handle multiple inputs/outputs, can't be the target of other relationships and can't be associated with property records. This approach won't work for anything but trivial relationships. For example the following statements cannot be well captured:
a. Gefitinib promotes the reaction "EGFR protein binds to EGFR protein" b. [TGFA protein binds to and affects the activity of EGFR protein] which decreases susceptibility to nickel sulfate c. X binds to EGFR with affinity Z
In a,b we want to say that something transforms the relationship between two other things. In c, we want to add a quantitative value to a relationship, but there may be disagreements about what the affinity is.
Factor relationships solve this, for example:
This approach creates a bipartite factor graph where:
Factor graphs should be able to capture very complex relationships, and support advanced modeling methods.