ranking-agent / aragorn

A Translator ARA combining asynchronous database querying, answer coalescence, and answer ranking.
MIT License
4 stars 3 forks source link

Nodes referenced in edges not present in knowledge graph #248

Closed MarkDWilliams closed 4 months ago

MarkDWilliams commented 5 months ago

I'm noticing an issue where Aragorn is returning edges that reference nodes as either subject but there is no node by that identifier in knowledge_graph>nodes. I've included links to an example from the ARS, but I tried sending the same query directly to Aragorn without involving the ARS at all and got the same issue. So, it seems like it is not due to blocklist or node normalization from the ARS which was what I initially suspected. Link to the parent https://ars.ci.transltr.io/ars/api/messages/dc6058da-0dcc-47a3-a549-5619d84d987c?trace=y Aragorn's specific return https://ars.ci.transltr.io/ars/api/messages/db45177f-86c6-4588-8f25-d86eca3cb1a8

These are some subjects that don't appear to have a corresponding node: PathWhiz.Reaction:1926 NCBIGene:7442 MESH:D013762

MarkDWilliams commented 5 months ago

aragornMissingSubjects.json

cbizon commented 4 months ago

Confirmed that the nodes are missing. The edges look like they are not orphans. There is an edge binding to an inferred edge, and in the support graph, there is an edge and the subject of that edge is not in the KG.

cbizon commented 4 months ago

I thought it could have something to do with normalization but the NCBIGene and MESH both normalize

cbizon commented 4 months ago

OK, I think that this is a strider problem. Attached is a json of a result coming back from multistrider.

82a1487bd5ef_19.json

The subquery here is (on:MONDO:0002959)-[edge_0]-(?e:Chemical)-[edge_1]-(?i:BilogicalEntity)-[edge_2]-(sn:Chemical)

What happens with this kind of query is that sometimes e and sn get bound to the same node. It fits the query, but the real query should have an AND e != sn clause. And it makes gibberish answers the way that they get laid out in the UI. So ARAGORN removes that kind of answer, then it does a round of de-orphaning. Note that it does this after merging all the strider sub-answers.

When this runs, this particular subset ends up with an edge that has a subject of NCBIGene:55244, even though that node has been removed. The edge that is still sticking around is f24861a70be5. There are 3 results that contain this edge. Two of them have e!=sn and look like: (MONDO:0002959)-e0-(CHEBI:6801)-f24861a70be5-(NCBIGene:55244)-f24861a70be5-(CHEBI:6801)

So one tangential point: I'm not sure why this result is repeated.

The third one, however looks like this: (MONDO:0002959)-e0-(CHEBI:6888)-f24861a70be5-(NCBIGene:5243)-e2-(CHEBI:6801)

The edge f24... looks like:

{
  "subject": "NCBIGene:55244",
  "object": "CHEBI:6801",
  ...
}

So I think that its inclusion in this result is in error, because neither the subject nor object lines up. Furthermore, there are no subclassings in NCBIGene, and the two CHEBI id's are also not subclasses of one another.

There are these errors in that message as well, which seem relevant but I don't understand:

{
  "timestamp": "2024-07-03T19:11:31.756346",
  "level": "ERROR",
  "code": null,
  "message": "[be4d41c3.7ee443c1.4f57aaa1] Setting NCBIGene:5243 query id on NCBIGene:55244"
}
{
  "timestamp": "2024-07-03T19:11:31.756205",
  "level": "ERROR",
  "code": null,
  "message": "[be4d41c3.7ee443c1.4f57aaa1] Got back {\"node_bindings\": {\"i\": [{\"id\": \"NCBIGene:55244\", \"query_id\": null, \"attributes\": []}], \"sn\": [{\"id\": \"CHEBI:6801\", \"query_id\": null, \"attributes\": []}]}, \"analyses\": [{\"resource_id\": \"infores:aragorn\", \"edge_bindings\": {\"edge_1\": [{\"id\": \"f24861a70be5\", \"attributes\": []}]}, \"score\": null, \"support_graphs\": null, \"scoring_method\": null, \"attributes\": null}]}"
}
{
  "timestamp": "2024-07-03T19:11:31.755937",
  "level": "ERROR",
  "code": null,
  "message": "[be4d41c3.7ee443c1.4f57aaa1] infores:automat-robokop gave back NCBIGene:55244 and doesn't match what was sent."
}
{
  "timestamp": "2024-07-03T19:11:31.124980",
  "level": "ERROR",
  "code": null,
  "message": "[be4d41c3.7ee443c1.4f57aaa1] Setting NCBIGene:5243 query id on NCBIGene:55244"
}
{
  "timestamp": "2024-07-03T19:11:31.124834",
  "level": "ERROR",
  "code": null,
  "message": "[be4d41c3.7ee443c1.4f57aaa1] Got back {\"node_bindings\": {\"i\": [{\"id\": \"NCBIGene:55244\", \"query_id\": null, \"attributes\": []}], \"sn\": [{\"id\": \"CHEBI:6801\", \"query_id\": null, \"attributes\": []}]}, \"analyses\": [{\"resource_id\": \"infores:aragorn\", \"edge_bindings\": {\"edge_1\": [{\"id\": \"f24861a70be5\", \"attributes\": []}]}, \"score\": null, \"support_graphs\": null, \"scoring_method\": null, \"attributes\": null}]}"
}
{
  "timestamp": "2024-07-03T19:11:31.124544",
  "level": "ERROR",
  "code": null,
  "message": "[be4d41c3.7ee443c1.4f57aaa1] infores:automat-ctd gave back NCBIGene:55244 and doesn't match what was sent."
}

So this might be a plater problem of some kind? But even if it is, the strider response to whatever is happening is not working correctly.

cbizon commented 4 months ago

This seems like it might be related: https://github.com/NCATSTranslator/Feedback/issues/853. One of the validation errors indicates this problem.

maximusunc commented 4 months ago

original_message.json after_orphan_filtering_message.json

So the bug still may be in Strider and the results are getting messed up somehow, but the two linked files above show what my PR was hoping to solve. There's a result that contains an edge that references the missing node. The filtering currently is removing the node. I'm going to keep looking into Strider to see if/what's going on there.

maximusunc commented 4 months ago

The node in question is: PathWhiz.Reaction:1903

cbizon commented 4 months ago

Right, I think your PR handles the case where the subject or object of the edge doesn't match the node binding. What has given me pause is in trying to understand why they don't match up and whether that's right or wrong. So in the message you posted, there's this result:

{
        "node_bindings": {
          "i": [
            {
              "id": "GO:0015101",
              "attributes": []
            }
          ],
          "sn": [
            {
              "id": "CHEBI:3387",
              "attributes": []
            }
          ],
          "e": [
            {
              "id": "CHEBI:6801",
              "attributes": []
            }
          ],
          "on": [
            {
              "id": "MONDO:0002959",
              "attributes": []
            }
          ]
        },
        "analyses": [
          {
            "resource_id": "infores:aragorn",
            "edge_bindings": {
              "edge_2": [
                {
                  "id": "f233a8c0c466",
                  "attributes": []
                }
              ],
              "edge_1": [
                {
                  "id": "e3b9b67046fd",
                  "attributes": []
                },
                {
                  "id": "d0d3bb709001",
                  "attributes": []
                },
                {
                  "id": "d2351a4a583c",
                  "attributes": []
                }
              ],
              "edge_0": [
                {
                  "id": "0b4109ed29f5",
                  "attributes": []
                }
              ]
            }
          }
        ]
      },

All of those edge_1 kg_edges have pathwhiz reactions as their subject, which should be node i. But Node i is a GO term. So I suppose what is happening is that there is subclass reasoning happening implicitly here? Hard to evaluate though. For one thing, I can't find any PathWhiz id's with those numbers (and the pathwhiz server seems dead). But I don't see how they can be subclasses of a go term.

But I suspect that's what's happening or how this should be interpreted? But in the example that I posted above where the bum node was a gene, this explanation would require that we have "gene subclass of other gene" which I don't think anybody should be returning.

If these are really subclass of issues, I think there are maybe some better ways to represent things?

maximusunc commented 4 months ago

I think we're on the same page. I think Strider is doing something wrong and these results are either getting mangled or an explicit subclass is becoming ambiguous. The PR would just a patch to get rid of the error, but the real fix is in Strider.

maximusunc commented 4 months ago

The fix is currently deployed in DEV. Here's a couple links: ARS-DEV: https://ars-dev.transltr.io/ars/api/messages/f49e7760-5a04-422d-aee6-b020edd81223 ARAX UI that shows no missing edge errors: https://arax.ncats.io/?r=f49e7760-5a04-422d-aee6-b020edd81223

maximusunc commented 4 months ago

Fix has been deployed to CI and will be included in the Fugu release.