eecavanna commented 1 month ago

Description

In this branch, I implemented a new API endpoint. Its behavior is described in its docstring, shown here:

@router.get("/nmdcschema/ids/{doc_id}/collection-name")
def get_collection_name_by_doc_id(
    doc_id: str,
    mdb: MongoDatabase = Depends(get_mongo_db),
):
    r"""
    Returns the name of the collection, if any, containing the document having the specified `id`.

    This endpoint uses the NMDC Schema to determine the schema class of which an instance could have
    the specified value as its `id`; and then uses the NMDC Schema to determine the names of the
    `Database` slots (i.e. Mongo collection names) that could contain instances of that schema class.

    This endpoint then searches those Mongo collections for a document having that `id`.
    If it finds one, it responds with the name of the collection containing the document.
    If it does not find one, it response with an `HTTP 404 Not Found` response.
    """

Fixes https://github.com/microbiomedata/nmdc-runtime/issues/531

Type of change

Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply make up-test && make test-run.

[x] I implemented an automated test targeting the new functionality and that test ran/passed on GitHub Actions:

Configuration Details: none

Checklist:

[x] My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
[x] I have performed a self-review of my code
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation (in docs/ and in https://github.com/microbiomedata/NMDC_documentation/?)
[x] I have added tests that prove my fix is effective or that my feature works, incl. considering downstream usage (e.g. https://github.com/microbiomedata/notebook_hackathons) if applicable.
[ ] New and existing unit and functional tests pass locally with my changes (make up-test && make test-run)

I delegated this test running to GitHub Actions, which passes

eecavanna commented 1 month ago

@aclum, is this the type of HTTP response you had in mind for the endpoint that we discussed; the endpoint that, for a given id, determines the collection—if any—in which a document having that id would reside, and determines the schema class—if any—of which an instance would have that as its id?

    {
        "id": "nmdc:sty-10-123456abcdef",
        "collection_name": "study_set",
        "class_name": "Study",
    }

Note: The value of the "id" property in the HTTP response would be a verbatim copy of the one in the HTTP request. It's only included in the response for the convenience of the client.

CC: @sujaypatil96

aclum commented 1 month ago

Yes. I think this would also be useful for the data portal, I have a ticket that has been in for a while to be able to search by an nmdc identifier. https://github.com/microbiomedata/nmdc-server/issues/964 cc @naglepuff @marySalvi @jeffbaumes

dwinston commented 3 weeks ago

What is expected for an id that looks like e.g. nmdc:clsite-99-123abc? From /nmdcschema/typecodes, clsite is the typecode for a CollectingBiosamplesFromSite object, but this can be in either the collecting_biosamples_from_site_set (https://github.com/microbiomedata/nmdc-schema/blob/v10.5.3/nmdc_schema/nmdc_materialized_patterns.schema.json#L3531) or planned_process_set (https://github.com/microbiomedata/nmdc-schema/blob/v10.5.3/nmdc_schema/nmdc_materialized_patterns.schema.json#L3653) slots, according to the schema. So in this case, both collection names should be returned, no?

I have updated the PR to reflect this and return a list of collections.

aclum commented 3 weeks ago

The issue of being valid for multiple collections is fixed in berkeley schema. The original feature of this endpoint was to return which collection the ID is in so if it is ambiguous for now it should check which collection the document actually resides in.

eecavanna commented 2 weeks ago

The original feature of this endpoint was to return which collection the ID is in so if it is ambiguous for now it should check which collection the document actually resides in.

This endpoint only uses the schema, not the database. The class_name it returns is the name of the schema class, if any, for which the specified value would be a valid id (on an instance of that class). The collection_names list it returns is a list of the names of all the "collections" (technically, the names of slots of the Database schema class), if any, that could contain a document having that value as its id.

In other words, it currently doesn't do this:

if it is ambiguous for now it should check which collection the document actually resides in.

Instead, it returns a list of all collections that — according to the schema — a document having that id could reside in.

CC: @aclum

eecavanna commented 2 weeks ago

Sounds to me like one of the requirements for this endpoint is to also indicate which collection, if any, a document having the specified id resides in.

eecavanna commented 2 weeks ago

Sounds to me like one of the requirements for this endpoint is to also indicate which collection, if any, a document having the specified id resides in.

Update: I have added this feature to the endpoint.

Here's the response shape:

{
    "id": "string",                             // `id` from the URL (the `hypothetical_doc_id` path parameter)
    "compatible_class_name": "string",          // name of the class of which an instance _could_ have that `id`
    "compatible_collection_names": ["string"],  // names of all collections that _could_ contain a document having that `id`
    "containing_collection_name": "string"      // name of the collection in which a document having that `id` _does_ exist, if any
}

Example response:

{
    "id": "nmdc:sty-1-foobar",
    "compatible_class_name": "Study",
    "compatible_collection_names": ["study_set"],
    "containing_collection_name": "study_set"
}

eecavanna commented 2 weeks ago

I am ready for this PR branch to be reviewed/merged.

eecavanna commented 2 weeks ago

Hi @aclum, are you OK with how this endpoint behaves? If so (and someone approves it via GitHub's Review mechanism), I'll merge it in. I accidentally invalidated @sujaypatil96's approval by making an additional commit.

@sujaypatil96 added screenshots of some example request/response pairs above—thanks, @sujaypatil96!

PeopleMakeCulture commented 2 weeks ago

@eecavanna You're good to merge. If/when there's time, I'd like to consolidate and replace these utils with your logic. (non-blocking):

nmdc_runtime.api.core.metadata.map_id_to_collection
nmdc_runtime.util.collection_name_to_class_names
nmdc_runtime.api.db.mongo.nmdc_schema_collection_names

eecavanna commented 2 weeks ago

Based on @aclum's feedback during today's infrastructure meeting, I will update the endpoint as follows:

[x] 1. If a document having the specified id exists in any of the collections the schema says it can exist in, return the name of that collection. If it happens to exist in multiple collections, only return the name of one collection that it exists in.
[x] 2. If no document having the specified id exists in any of the collections the schema says it can exist in, return a "null-ish" response (TBD).

eecavanna commented 2 weeks ago

Sorry about all the thrash here! Getting the automated tests running locally was not happening smoothly for me (I think one of the tests hung last night), so I've been relying on the GHA workflow to run them for me. A downside of that is that the test failures generate notifications—at least to me.

eecavanna commented 2 weeks ago

Hi @aclum, I updated the API response based upon our conversation earlier today.

The new behavior is:

Scenario 1: User specifies the id of nmdc:sty-1-foobar to the API. A document having that id exists in the study_set collection. The API responds with:
```
{
  "id": "nmdc:sty-1-foobar",
  "collection_name": "study_set",
}
```
Scenario 2: User specifies the id of nmdc:sty-1-nonex to the API. A document having that id does not exist in any collections (that the schema says it can exist in). The API responds with a HTTP 404 Not Found response.

I'm ready for this PR branch to be reviewed/merged in.

microbiomedata / nmdc-runtime

Implement API endpoint that returns name of collection containing document having a given `id` #532

Description

Type of change

How Has This Been Tested?

Checklist: