microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
4 stars 3 forks source link

Implement API endpoint that returns name of collection containing document having a given `id` #532

Closed eecavanna closed 2 weeks ago

eecavanna commented 1 month ago

Description

In this branch, I implemented a new API endpoint. Its behavior is described in its docstring, shown here:

@router.get("/nmdcschema/ids/{doc_id}/collection-name")
def get_collection_name_by_doc_id(
    doc_id: str,
    mdb: MongoDatabase = Depends(get_mongo_db),
):
    r"""
    Returns the name of the collection, if any, containing the document having the specified `id`.

    This endpoint uses the NMDC Schema to determine the schema class of which an instance could have
    the specified value as its `id`; and then uses the NMDC Schema to determine the names of the
    `Database` slots (i.e. Mongo collection names) that could contain instances of that schema class.

    This endpoint then searches those Mongo collections for a document having that `id`.
    If it finds one, it responds with the name of the collection containing the document.
    If it does not find one, it response with an `HTTP 404 Not Found` response.
    """

Fixes https://github.com/microbiomedata/nmdc-runtime/issues/531

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply make up-test && make test-run.

Configuration Details: none

Checklist:

eecavanna commented 1 month ago

@aclum, is this the type of HTTP response you had in mind for the endpoint that we discussed; the endpoint that, for a given id, determines the collection—if any—in which a document having that id would reside, and determines the schema class—if any—of which an instance would have that as its id?

    {
        "id": "nmdc:sty-10-123456abcdef",
        "collection_name": "study_set",
        "class_name": "Study",
    }

Note: The value of the "id" property in the HTTP response would be a verbatim copy of the one in the HTTP request. It's only included in the response for the convenience of the client.

CC: @sujaypatil96

aclum commented 1 month ago

Yes. I think this would also be useful for the data portal, I have a ticket that has been in for a while to be able to search by an nmdc identifier. https://github.com/microbiomedata/nmdc-server/issues/964 cc @naglepuff @marySalvi @jeffbaumes

dwinston commented 3 weeks ago

What is expected for an id that looks like e.g. nmdc:clsite-99-123abc? From /nmdcschema/typecodes, clsite is the typecode for a CollectingBiosamplesFromSite object, but this can be in either the collecting_biosamples_from_site_set (https://github.com/microbiomedata/nmdc-schema/blob/v10.5.3/nmdc_schema/nmdc_materialized_patterns.schema.json#L3531) or planned_process_set (https://github.com/microbiomedata/nmdc-schema/blob/v10.5.3/nmdc_schema/nmdc_materialized_patterns.schema.json#L3653) slots, according to the schema. So in this case, both collection names should be returned, no?

I have updated the PR to reflect this and return a list of collections.

aclum commented 3 weeks ago

The issue of being valid for multiple collections is fixed in berkeley schema. The original feature of this endpoint was to return which collection the ID is in so if it is ambiguous for now it should check which collection the document actually resides in.

eecavanna commented 2 weeks ago

The original feature of this endpoint was to return which collection the ID is in so if it is ambiguous for now it should check which collection the document actually resides in.

This endpoint only uses the schema, not the database. The class_name it returns is the name of the schema class, if any, for which the specified value would be a valid id (on an instance of that class). The collection_names list it returns is a list of the names of all the "collections" (technically, the names of slots of the Database schema class), if any, that could contain a document having that value as its id.

In other words, it currently doesn't do this:

if it is ambiguous for now it should check which collection the document actually resides in.

Instead, it returns a list of all collections that — according to the schema — a document having that id could reside in.

CC: @aclum

eecavanna commented 2 weeks ago

Sounds to me like one of the requirements for this endpoint is to also indicate which collection, if any, a document having the specified id resides in.

eecavanna commented 2 weeks ago

Sounds to me like one of the requirements for this endpoint is to also indicate which collection, if any, a document having the specified id resides in.

Update: I have added this feature to the endpoint.

Here's the response shape:

{
    "id": "string",                             // `id` from the URL (the `hypothetical_doc_id` path parameter)
    "compatible_class_name": "string",          // name of the class of which an instance _could_ have that `id`
    "compatible_collection_names": ["string"],  // names of all collections that _could_ contain a document having that `id`
    "containing_collection_name": "string"      // name of the collection in which a document having that `id` _does_ exist, if any
}

Example response:

{
    "id": "nmdc:sty-1-foobar",
    "compatible_class_name": "Study",
    "compatible_collection_names": ["study_set"],
    "containing_collection_name": "study_set"
}
eecavanna commented 2 weeks ago

I am ready for this PR branch to be reviewed/merged.

eecavanna commented 2 weeks ago

Hi @aclum, are you OK with how this endpoint behaves? If so (and someone approves it via GitHub's Review mechanism), I'll merge it in. I accidentally invalidated @sujaypatil96's approval by making an additional commit.

@sujaypatil96 added screenshots of some example request/response pairs above—thanks, @sujaypatil96!

PeopleMakeCulture commented 2 weeks ago

@eecavanna You're good to merge. If/when there's time, I'd like to consolidate and replace these utils with your logic. (non-blocking):

eecavanna commented 2 weeks ago

Based on @aclum's feedback during today's infrastructure meeting, I will update the endpoint as follows:

eecavanna commented 2 weeks ago

Sorry about all the thrash here! Getting the automated tests running locally was not happening smoothly for me (I think one of the tests hung last night), so I've been relying on the GHA workflow to run them for me. A downside of that is that the test failures generate notifications—at least to me.

eecavanna commented 2 weeks ago

Hi @aclum, I updated the API response based upon our conversation earlier today.

The new behavior is:

  1. Scenario 1: User specifies the id of nmdc:sty-1-foobar to the API. A document having that id exists in the study_set collection. The API responds with:

    {
      "id": "nmdc:sty-1-foobar",
      "collection_name": "study_set",
    }
  2. Scenario 2: User specifies the id of nmdc:sty-1-nonex to the API. A document having that id does not exist in any collections (that the schema says it can exist in). The API responds with a HTTP 404 Not Found response.

I'm ready for this PR branch to be reviewed/merged in.