monarch-initiative / biolink-api

API for linked biological knowledge
https://api.monarchinitiative.org/api/
BSD 3-Clause "New" or "Revised" License
63 stars 25 forks source link

`/sim/compare` `GET` results different from `POST` results #389

Closed vincerubinetti closed 2 years ago

vincerubinetti commented 2 years ago

@falquaddoomi and/or @putmantime Can you investigate this a little bit?

Not sure if this is intentional or not, but if it is intentional, it's confusing and not documented enough.

Example GET:

curl -X GET "https://api.monarchinitiative.org/api/sim/compare?is_feature_set=false&metric=phenodigm&ref_id=HP%3A0000322&ref_id=HP%3A0001166&ref_id=HP%3A0001238&query_id=HP%3A0000322&query_id=HP%3A0001166&query_id=HP%3A0001238" -H "accept: application/json
Response ```json { "query": { "ids": [ { "id": "HP:0001166", "label": "Arachnodactyly" }, { "id": "HP:0001238", "label": "Slender finger" }, { "id": "HP:0000322", "label": "Short philtrum" } ], "negated_ids": [], "unresolved_ids": [], "target_ids": [ [ { "id": "HP:0001166", "label": "Arachnodactyly" }, { "id": "HP:0001238", "label": "Slender finger" }, { "id": "HP:0000322", "label": "Short philtrum" } ] ], "reference": { "type": "unknown", "taxon": { "id": null, "label": null }, "id": "HP:0000322 + HP:0001166 + HP:0001238", "label": "HP:0000322 + HP:0001166 + HP:0001238" } }, "matches": [ { "rank": "NaN", "score": 100, "significance": "NaN", "pairwise_match": [ { "reference": { "IC": 7.764605157463794, "id": "HP:0000322", "label": "Short philtrum" }, "match": { "IC": 7.764605157463794, "id": "HP:0000322", "label": "Short philtrum" }, "lcs": { "IC": 7.764605157463794, "id": "MP:0030193", "label": "short philtrum" } }, { "reference": { "IC": 7.217375742667661, "id": "HP:0001166", "label": "Arachnodactyly" }, "match": { "IC": 7.217375742667661, "id": "HP:0001166", "label": "Arachnodactyly" }, "lcs": { "IC": 7.217375742667661, "id": "MP:0006296", "label": "arachnodactyly" } } ], "type": null, "taxon": { "id": null, "label": null }, "id": "HP:0000322 + HP:0001166 + HP:0001238", "label": "HP:0000322 + HP:0001166 + HP:0001238" } ], "metadata": { "max_max_ic": 16.48703 } } ```

Equivalent example POST:

curl -X POST "https://api.monarchinitiative.org/api/sim/compare" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"reference_ids\": [ \"HP:0000322\", \"HP:0001166\", \"HP:0001238\" ], \"query_ids\": [ [\"HP:0000322\"], [\"HP:0001166\"], [\"HP:0001238\"] ]}"
Response ```json { "query": { "ids": [ { "id": "HP:0001166", "label": "Arachnodactyly" }, { "id": "HP:0001238", "label": "Slender finger" }, { "id": "HP:0000322", "label": "Short philtrum" } ], "negated_ids": [], "unresolved_ids": [], "target_ids": [ [ { "id": "HP:0000322", "label": "Short philtrum" } ], [ { "id": "HP:0001166", "label": "Arachnodactyly" } ], [ { "id": "HP:0001238", "label": "Slender finger" } ] ], "reference": { "type": "unknown", "taxon": { "id": null, "label": null }, "id": "HP:0000322 + HP:0001166 + HP:0001238", "label": "HP:0000322 + HP:0001166 + HP:0001238" } }, "matches": [ { "rank": "NaN", "score": 84, "significance": "NaN", "pairwise_match": [ { "reference": { "IC": 7.764605157463794, "id": "HP:0000322", "label": "Short philtrum" }, "match": { "IC": 7.764605157463794, "id": "HP:0000322", "label": "Short philtrum" }, "lcs": { "IC": 7.764605157463794, "id": "MP:0030193", "label": "short philtrum" } } ], "type": "phenotype", "taxon": { "id": null, "label": null }, "id": "HP:0000322", "label": "Short philtrum" }, { "rank": "NaN", "score": 78, "significance": "NaN", "pairwise_match": [ { "reference": { "IC": 7.217375742667661, "id": "HP:0001166", "label": "Arachnodactyly" }, "match": { "IC": 7.217375742667661, "id": "HP:0001166", "label": "Arachnodactyly" }, "lcs": { "IC": 7.217375742667661, "id": "MP:0006296", "label": "arachnodactyly" } } ], "type": "phenotype", "taxon": { "id": null, "label": null }, "id": "HP:0001166", "label": "Arachnodactyly" }, { "rank": "NaN", "score": 77, "significance": "NaN", "pairwise_match": [ { "reference": { "IC": 7.217375742667661, "id": "HP:0001166", "label": "Arachnodactyly" }, "match": { "IC": 7.151580619870594, "id": "HP:0001238", "label": "Slender finger" }, "lcs": { "IC": 7.151580619870594, "id": "HP:0001238", "label": "Slender finger" } } ], "type": "phenotype", "taxon": { "id": null, "label": null }, "id": "HP:0001238", "label": "Slender finger" } ], "metadata": { "max_max_ic": 16.48703 } } ```

Perhaps a related issue, trying to use the GET endpoint seems to only return 1 match no matter what I enter (but maybe I just haven't tried enough examples).

vincerubinetti commented 2 years ago

~I just noticed in the GET response that one of the phenotype sets is empty: "target_ids": [[]].~ This is no longer the case, I don't know where/when/why I saw this.

Note that I just used the Swagger UI thing to fill in these parameters...

image
falquaddoomi commented 2 years ago

I'm not seeing the response for that first command curl -X GET "https://api.monarchinitiative.org/api/sim/compare?is_feature_set=false&metric=phenodigm&ref_id=HP%3A0000322&ref_id=HP%3A0001166&ref_id=HP%3A0001238&query_id=HP%3A0000322&query_id=HP%3A0001166&query_id=HP%3A0001238" -H "accept: application/json" that you mentioned; instead I see this:

{
  "query": {
    "ids": [
      {
        "id": "HP:0001166",
        "label": "Arachnodactyly"
      },
      {
        "id": "HP:0001238",
        "label": "Slender finger"
      },
      {
        "id": "HP:0000322",
        "label": "Short philtrum"
      }
    ],
    "negated_ids": [],
    "unresolved_ids": [],
    "target_ids": [
      [
        {
          "id": "HP:0001166",
          "label": "Arachnodactyly"
        },
        {
          "id": "HP:0001238",
          "label": "Slender finger"
        },
        {
          "id": "HP:0000322",
          "label": "Short philtrum"
        }
      ]
    ],
    "reference": {
      "type": "unknown",
      "taxon": {
        "id": null,
        "label": null
      },
      "id": "HP:0000322 + HP:0001166 + HP:0001238",
      "label": "HP:0000322 + HP:0001166 + HP:0001238"
    }
  },
  "matches": [
    {
      "rank": "NaN",
      "score": 100,
      "significance": "NaN",
      "pairwise_match": [
        {
          "reference": {
            "IC": 7.764605157463794,
            "id": "HP:0000322",
            "label": "Short philtrum"
          },
          "match": {
            "IC": 7.764605157463794,
            "id": "HP:0000322",
            "label": "Short philtrum"
          },
          "lcs": {
            "IC": 7.764605157463794,
            "id": "MP:0030193",
            "label": "short philtrum"
          }
        },
        {
          "reference": {
            "IC": 7.217375742667661,
            "id": "HP:0001166",
            "label": "Arachnodactyly"
          },
          "match": {
            "IC": 7.217375742667661,
            "id": "HP:0001166",
            "label": "Arachnodactyly"
          },
          "lcs": {
            "IC": 7.217375742667661,
            "id": "MP:0006296",
            "label": "arachnodactyly"
          }
        }
      ],
      "type": null,
      "taxon": {
        "id": null,
        "label": null
      },
      "id": "HP:0000322 + HP:0001166 + HP:0001238",
      "label": "HP:0000322 + HP:0001166 + HP:0001238"
    }
  ],
  "metadata": {
    "max_max_ic": 16.48703
  }
}

If I reduce the query_ids to just one (e.g., HP:0000322), the output is identical:

curl \
    -X GET \
    "https://api.monarchinitiative.org/api/sim/compare?is_feature_set=false&metric=phenodigm&ref_id=HP%3A0000322&query_id=HP%3A0000322" \
    -H "accept: application/json" > HP_0000322__GET.json

curl \
    -X POST \
    "https://api.monarchinitiative.org/api/sim/compare" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{ \"reference_ids\": [ \"HP:0000322\" ], \"query_ids\": [ [\"HP:0000322\"] ]}"

diff HP_0000322__GET.json HP_0000322__POST.json

# no output, so nothing appears to be different

Perhaps a related issue, trying to use the GET endpoint seems to only return 1 match no matter what I enter (but maybe I just haven't tried enough examples).

Seems intentional to me; the GET endpoint's description is "Compare a reference profile vs one profiles" whereas the POST one is "Compare a reference profile vs one or more profiles". Is there a problem with using the POST endpoint if you need to do multiple comparisons?

vincerubinetti commented 2 years ago

I'm not seeing the response for that first command

I accidentally had pasted the wrong response json in there. I've updated it and it's what you posted above.

If I reduce the query_ids to just one (e.g., HP:0000322), the output is identical:

I believe they should be identical even if you don't just query one id.

GET endpoint's description is "Compare a reference profile vs one profiles" whereas the POST one is "Compare a reference profile vs one or more profiles".

It seems like you're thinking that a "profile" is one phenotype. Just to be clear, a "profile" in this case is a set of phenotypes.

That's why the POST endpoint takes an array of arrays for the query phenotypes:

image

If I do that same 3 x 3 comparison (HP:0000322, HP:0001166, HP:0001238) in Monarch UI 2.0, I get these similarity scores:

image

Those numbers don't exist in the GET response, only the POST.

Is there a problem with using the POST endpoint

None of this really affects my concerns on the frontend, I'm just bringing it up as a potential bug of concern.

falquaddoomi commented 2 years ago

D'oh, you're right, I totally misunderstood that "profiles" part, sorry.

In summary, I think what's happening is that when you issue a GET, you're getting a single profile that consists of "HP:0000322 + HP:0001166 + HP:0001238", as the id and label fields under query.reference indicate. When you issue a POST, the three phenotype IDs are being treated as three independent lists, each containing a single phenotype. I agree that this is definitely poorly documented, but I don't know if it's a bug per se...

Here's the GET vs POST code for /sim/compare: https://github.com/biolink/biolink-api/blob/master/biolink/api/sim/endpoints/semanticsim.py#L98. Aside from the default values being set in the POST parser (which you set anyway), I think the key difference is that SimCompare.sim_engine.compare() is called with query_profiles=[input_args['query_id']], i.e. the list you pass into the GET parser for query_id is always wrapped in a list.

Going back to your example, the GET value for query_profiles looks like ["HP:0000322", "HP:0001166", "HP:0001238"] and the POST looks like [ ["HP:0000322"], ["HP:0001166"], ["HP:0001238"] ]. I suppose the equivalent POST to your GET would be {..., "query_ids": [ ["HP:0000322", "HP:0001166", "HP:0001238"] ] }.

As far as making the behavior the same between the methods goes, we'd have to adopt a format for indicating lists of lists, as all you can indicate currently by repeating querystring arguments is a single list.

vincerubinetti commented 2 years ago

When you issue a POST, the three phenotype IDs are being treated as three independent lists, each containing a single phenotype

Yep that's the ticket, that's what was causing my issue.

Not sure why I put them separately like that. Might've been tired and confused by the schema.

We can consider this closed, though I would love if there was a way to add a comment to the code that shows up in the swagger docs that explains that schema. Would've saved me a lot of time writing this issue.

As far as making the behavior the same between the methods goes

I don't think we need to do that. GET being single and POST being multiple is fine, and it's actually documented. I was just confused as to why a 1 profile to 1 profile comparison was different between the two.

vincerubinetti commented 2 years ago

Not sure why I put them separately like that. Might've been tired and confused by the schema.

Now I remember why. Monarch UI 2.0 puts them like that:

image

It seems to me like that's an error... the query_ids should be in a single array like you formatted them. The 2.0 phenotype profile search "wizard" seems to suggest that we're comparing one set of phenotypes to another set of phenotypes, not several sets of single phenotypes....

I guess we can change this issue into a docs issue, plus a question for the TISLab of whether 2.0 was mistaken here.

vincerubinetti commented 2 years ago

@jmcmurry confirmed that it was a mistake, or at least an odd choice, for the UI 2.0 to break up the second set of phenotypes individually.

We could turn this into a docs issue, because it's still not 100% certain what this means:

image

I thought you just had to wrap query_ids in another array kinda for no reason, but in hindsight you can understand it from the context of the endpoint.

So, it could still be made more clear with docs, but it's also no more vague than any of the other endpoints, so closing.