paleobot / pbot-dev

Codebase and initial design documents for pbot client
MIT License
2 stars 2 forks source link

Implement OTU synonymies #58

Closed NoisyFlowers closed 1 year ago

NoisyFlowers commented 2 years ago

We need to be able to handle synonymies on OTUs. See the attached diagram for an example.

OTU-Synonomies

In this example, the "Durodon" nodes are OTU Description nodes. The synonymies are represented by intermediate nodes called Opinion (actual name TBD). Using an intermediate node allows us to attach Reference nodes to the opinion via CITED_BY relationships, and a new node type called Comment. A Comment node will be used to capture a comment and up/down vote for the Opinion. We will use the Reference dates and Comment up/down vote tallies in queries involving synonymy lineage.

aazaff commented 2 years ago

Using the above example, this was our intended order of operations for a query:

OBJECTIVE: Find synonyms of Dorudon atrox

  1. Query for all OPINION nodes directly attached to Dorudon atrox OTU.
  2. RETURNS = OPINION 1 and OPINION 3
  3. FIND "BEST" OPINION based on two rules
    1. Which has most votes (least downvotes?)
    2. If equal votes, opinion with most recent reference
  4. OPINION 1 would win because OPINION 3 had one downvote COMMENT
  5. Query would then follow through all child relations of OPINION 1
aazaff commented 2 years ago

COUNTER CASE: Why the above workflow will NOT work.

There may be a VALID one-to-many relationship with opinions in this direction. For example, in the above picture, Dorudon apple, Dorudon wrong, and Dorudon old are all "valid" because opinion 1 and opinion 2 do not "contradict" each other, but the workflow proposed above always chooses the "best" opinion (i.e., only follows 1 opinion).

IMAGE 2022-05-09 12:25:37

aazaff commented 2 years ago

I think the correct way to deal with this problem is to rethink the cardinality of a query. We never ask what the children of an OTU are, but rather what is the best name for that particular name. In this case the relationship will always be linear - i.e., one-to-one.

For example, in the above two pictured examples, we can much more easily determine that for ANY OTU we start with, the best final name for that OTU is Dorudon atrox. It is only if we go backwards that problems will occur.

NoisyFlowers commented 2 years ago

Questions:

  1. How does our ranking system figure into the "find best" query? Do we just stop the search when we hit an Opinion that is over a threshold of down votes?
  2. Are we sure we don't want to support queries of the first sort as well? Maybe we would just traverse the whole Opinion subgraph and present all lower synonyms as the result, whatever their rank or age.

Different topic:

I'm trying to come up with meaningful names for the new node types and relationships.

I don't like Opinion. It feels too generic for what we are trying to represent. I also don't like the direction of our relationships, because I'm having trouble coming up with suitable names for them.

Without the intermediate node, we might have a relationship between OTU Description nodes called REPLACED_BY, pointing in the direction of the new Description. Adding an intermediate node means we are replacing that relationship with a first class entity that represents that relationship. Each original entity now has a relationship to this new entity. We are not obligated to keep the overall direction of the original relationship for these new relationships (and Neo4j can traverse both directions anyhow). 

I'm considering this instead: image

Whatever the directions of our relationships, Neo4j doesn't natively support repeating pattern matching (meaning a pattern containing multiple different relationship and node types), only repeating relationship matching. But APOC does, albeit with a somewhat cryptic syntax (https://neo4j.com/labs/apoc/4.1/graph-querying/expand-paths-config/). Finding the best (newest) synonym for a given OTU would then be something like:

MATCH
    (startOTU:Description {potID:"old"})
CALL apoc.path.expandConfig(startOTU, {
    sequence: "Description,REPLACED_BY>,Synonym,<REPLACES"
}) YIELD path
RETURN last(nodes(path)) as best

Or something like that. I haven't tried it yet.

aazaff commented 2 years ago

I am content with the Replaces and Replaced_by relations. I am loose on Synonym. Is "opinion" really so bad? Because, synonymy is a type of opnion so wouldn't it be better to be more general than more specific?

NoisyFlowers commented 2 years ago
  1. I'm not sure we gain anything with generalized node types. The more specific the node type, the easier (in terms of writing the query and in terms of Neo4j efficiency) it is to query for exactly what you want. (As an aside, I've been wondering lately about the wisdom of discerning Description types by a type flag. We might have gained some functionality and performance by using separate types. But that's a separate conversation.)

  2. If we call the nodes Opinion, the REPLACED_BY/REPLACES relationship names don't really work. I can't think of good names to hang off of Opinion.

  3. I don't think Opinion captures the function of that node well. Actually, it sounds more like what we intend Comment to handle.

NoisyFlowers commented 2 years ago

This whole thing is kind of nightmarish, as long as we allow the circular Synonym relationship described at the top.

To get a better grip on how to design synonyms, I've been trying to think about how a client will interact with them.

We've mentioned a query in which the client says "find me the best synonym of this old thing". This would move forward through REPLACED_BY->Synonym<-REPLACES-OTU paths until it hit a terminus. The problem here is that circular diagram you had up front. We still need our age and ranking system to define the terminus.

We've mentioned a query that says " find me all the lesser synonyms of x".  This one would probably be a full traversal of all REPLACES->Synonym<-REPLACED_BY-OTU paths starting from x, collecting everything OTU along the way. Except that circular paths will play hell with this as well.

What about mutations? Since Synonym is a first class entity, I think it should have explicit mutations and corresponding client forms. In other words, the user must create the new OTU first, then create synonyms to older OTUs. 

I'm assuming there should be only one path from a given OTU to the best OTU. What I mean is the Synonym for other 2 to best in the diagram below would not be allowed:

Screenshot 2022-05-12 135004

Note that this is different than the circular path in the first diagram in this issue, which tries to make an older OTU the best. This rule merely says you can't create a direct Synonym path from an old OTU to a better one if there is already a path from the old one to the better one.

For Synonym creation, the user first selects the OTU that REPLACES the other. Then the selection list they are presented for the REPLACED_BY would only contain OTUs that are not already REPLACED_BY, directly or indirectly, the first OTU selected. For a newly create OTU, this would be all other OTUs. Note that this would still allow a user to select other 2 as the first OTU and best as the second, since that replacement goes the other direction: other 2 REPLACES best. We want to allow that, as described up front. But at that point we've got the circular thing going again.

In the interest of simplifying management, I think maybe updates would not be allowed. You have to delete and create anew.

What if someone deletes the Synonym between other 01 and other 02? Then the lineage from other 02 to best would be broken. Do we allow that? Or can you only delete Syns that are terminal? But this is a problem if the path is circular. (While we aren't allowing the Synonym from other 02 to best, we've just said we would allow a Synonym from best to other 02.)

aazaff commented 2 years ago

One simplifying aspect that I did not consider is in our actual use case, as opposed to the one from the pbot I was originally using as a starting point, is that we can reasonably ignore polarity.

So there is no “best” name. Just good synonyms and bad synonyms. Does that simplify things?

Get Outlook for iOShttps://aka.ms/o0ukef


From: NoisyFlowers @.> Sent: Thursday, May 12, 2022 2:17:19 PM To: paleobot/pbot-dev @.> Cc: Zaffos, Andrew - (azaffos) @.>; Assign @.> Subject: [EXT]Re: [paleobot/pbot-dev] Implement OTU synonymies (Issue #58)

External Email

This whole thing is kind of nightmarish, as long as we allow the circular Synonym relationship described at the top.

To get a better grip on how to design synonyms, I've been trying to think about how a client will interact with them.

We've mentioned a query in which the client says "find me the best synonym of this old thing". This would move forward through REPLACED_BY->Synonym<-REPLACES-OTU paths until it hit a terminus. The problem here is that circular diagram you had up front. We still need our age and ranking system to define the terminus.

We've mentioned a query that says " find me all the lesser synonyms of x". This one would probably be a full traversal of all REPLACES->Synonym<-REPLACED_BY-OTU paths starting from x, collecting everything OTU along the way. Except that circular paths will play hell with this as well.

What about mutations? Since Synonym is a first class entity, I think it should have explicit mutations and corresponding client forms. In other words, the user must create the new OTU first, then create synonyms to older OTUs.

I'm assuming there should be only one path from a given OTU to the best OTU. What I mean is the Synonym for other 2 to best in the diagram below would not be allowed:

[Screenshot 2022-05-12 135004]https://user-images.githubusercontent.com/12547812/168165853-417a1add-9dc8-40ea-8876-6170564907a3.jpg

Note that this is different than the circular path in the first diagram in this issue, which tries to make an older OTU the best. This rule merely says you can't create a direct Synonym path from an old OTU to a better one is there is already a path from the old one to the better one.

For Synonym creation, the user first selects the OTU that REPLACES the other. Then the selection list they are presented for the REPLACED_BY would only contain OTUs that are not already REPLACED_BY, directly or indirectly, the first OTU selected. For a newly create OTU, this would be all other OTUs. Note that this would still allow a user to select other 2 as the first OTU and best as the second, since that replacement goes the other direction: other 2 REPLACES best. We want to allow that, as described up front. But at that point we've got the circular thing going again.

In the interest of simplifying management, I think maybe updates would not be allowed. You have to delete and create anew.

What if someone deletes the Synonym between other 01 and other 02? Then the lineage from other 02 to best would be broken. Do we allow that? Or can you only delete Syns that are terminal? But this is a problem if the path is circular. (While we aren't allowing the Synonym from other 02 to best, we've just said we would allow a Synonym from best to other 02.)

— Reply to this email directly, view it on GitHubhttps://github.com/paleobot/pbot-dev/issues/58#issuecomment-1125430837, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACPQSQY5K2VKHFDXJOIHXTTVJVYN7ANCNFSM5VPCMYQQ. You are receiving this because you were assigned.Message ID: @.***>

NoisyFlowers commented 2 years ago

I don't see how.

NoisyFlowers commented 2 years ago

In the following graph, the right-most Desc is 001, the left-most is 003. There is a circular synonym from 003 to 001 (circled in red).

Screenshot 2022-05-13 160156

The following query asks for the "best" synonym for 001. It returns 001.

MATCH
    (d1:Description {title:"Desc-SynTest-2022-05-13-001"})
CALL apoc.path.expandConfig(d1, {
    sequence: "Description,REPLACED_BY>,Synonym,<REPLACES"
}) YIELD path
WITH path
RETURN last(collect(last(nodes(path)))) as best

Default behavior for apoc.path.expandConfig appears to be to stop when the starting node is re-encountered. That's good, I guess. At least it doesn't get stuck following the relationships round and round forever.

The problem is that this same behavior happens with whatever node we start on.

MATCH
    (d1:Description {title:"Desc-SynTest-2022-05-13-002"})
CALL apoc.path.expandConfig(d1, {
    sequence: "Description,REPLACED_BY>,Synonym,<REPLACES"
}) YIELD path
WITH path
RETURN last(collect(last(nodes(path)))) as best

Returns 002.

aazaff commented 2 years ago

So that’s not bad if, as I said in my previous comment, no node is truly “correct”. AS long as we complete the circuit and get all the synonyms then it’s good.

Get Outlook for iOShttps://aka.ms/o0ukef


From: NoisyFlowers @.> Sent: Friday, May 13, 2022 4:12:49 PM To: paleobot/pbot-dev @.> Cc: Zaffos, Andrew - (azaffos) @.>; Assign @.> Subject: [EXT]Re: [paleobot/pbot-dev] Implement OTU synonymies (Issue #58)

External Email

In the following graph, the right-most Desc is 001, the left-most is 003. There is a circular synonym from 003 to 001 (circled in red).

[Screenshot 2022-05-13 160156]https://user-images.githubusercontent.com/12547812/168399634-b20e5706-488b-4e0c-ae32-dba3a76c397f.jpg

The following query asks for the "best" synonym for 001. It returns 001.

MATCH (d1:Description {title:"Desc-SynTest-2022-05-13-001"}) CALL apoc.path.expandConfig(d1, { sequence: "Description,REPLACED_BY>,Synonym,<REPLACES" }) YIELD path WITH path RETURN last(collect(last(nodes(path)))) as best

Default behavior for apoc.path.expandConfig appears to be to stop when the starting node is re-encountered. That's good, I guess. At least it doesn't get stuck following the relationships round and round forever.

The problem is that this same behavior happens with whatever node we start on.

MATCH (d1:Description {title:"Desc-SynTest-2022-05-13-002"}) CALL apoc.path.expandConfig(d1, { sequence: "Description,REPLACED_BY>,Synonym,<REPLACES" }) YIELD path WITH path RETURN last(collect(last(nodes(path)))) as best

Returns 002.

— Reply to this email directly, view it on GitHubhttps://github.com/paleobot/pbot-dev/issues/58#issuecomment-1126569508, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACPQSQ3ATOG7XHGIXDNZLETVJ3OXDANCNFSM5VPCMYQQ. You are receiving this because you were assigned.Message ID: @.***>

aazaff commented 2 years ago

From zoom chat on May 16

  1. Split OTU and Specimen Descriptions nodes into two
  2. Add new relationship from specimens to OTU descriptions called Holotype_of; one-to-one rule
    1. (cont.) from specimen perspective one-to-many; from OTU its one-to-one
  3. A specimen can have multiple DESCRIBED_BYs
  4. Add intermediate SYNONYM node to SAME_AS relationships joining OTUs, no polarity
  5. Add intermediate NODE (undecided on name) to EXAMPLE_OF relationships between Specimen and OTU nodes
    1. (cont.) candiate names are Identification, Example,
  6. Obtain all valid synonyms; valid means that if you follow all of the Same_as relationships off of a node until you've completed circle OR the SYNONYM breaks the chain because its downvoted
NoisyFlowers commented 2 years ago

Something about simply splitting OTU and Description, keeping both as the hub of a Description-complex, feels off to me. What do you think of keeping Description nodes as the hub of all complexes and hanging a new OTU node type off of those when appropriate?

OTU-Desc

NoisyFlowers commented 2 years ago

Here's a potential can of worms: Should mergedDescriptions in OTU queries include Descriptions from Synonyms?

aazaff commented 2 years ago

No! They absolutely should not. That would have to be a separate class of query.

NoisyFlowers commented 2 years ago

Whew, good.

NoisyFlowers commented 2 years ago

Recording this hear for posterity, in case it ever comes up again. Here's how I think such a query would look in cypher (not thoroughly tested):

        CALL {
            MATCH
                (otu:OTU {pbotID:"26c180d1-6179-4f29-8b8a-e3740573da22"})<-[:EXAMPLE_OF|:HOLOTYPE_OF]-(specimen:Specimen)-[:DESCRIBED_BY]->(d:Description)
            RETURN
                d
            UNION
            MATCH
                (otu:OTU {pbotID:"26c180d1-6179-4f29-8b8a-e3740573da22"})-[:SAME_AS]->(:Synonym)<-[:SAME_AS]-(:OTU)<-[:EXAMPLE_OF|:HOLOTYPE_OF]-(specimen:Specimen)-[:DESCRIBED_BY]->(d:Description)
            RETURN
                d
        }
        WITH
            DISTINCT d //This gets rid of possible EXAMPLE_OF/HOLOTYPE_OF duplicates
            MATCH   
                (d)-[:DEFINED_BY]->(ci:CharacterInstance),
                (ci)-[:INSTANCE_OF]->(c:Character),
                (ci)-[hs:HAS_STATE]->(s:State),
                (d)-[:APPLICATION_OF]->(schema:Schema) 
                WITH
                    DISTINCT c, schema, s{.*, value: hs.value, order:avg(toInteger(hs.order))} //Tuck the order and value relationship properties in temp object with state for use later. For order, we want to save the average value for this state. By aggregating on order, we also limit s to distinct states, so we don't need to specify DISTINCT. 
                    RETURN
                        { 
                            schema: schema.title,
                            characterName: c.name,
                            characterID: c.pbotID,
                            stateName: s.name,
                            stateID: s.pbotID,
                            stateOrder: s.order,
                            stateValue: s.value
                        } AS md