microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
4 stars 3 forks source link

/v1/workflows/activities endpoint adds empty keys with null values to mongo #478

Closed aclum closed 4 months ago

aclum commented 4 months ago

I noticed some workflow records had null values, these were keys that the workflow automation code was not using. I tested and was able to confirm with runtime dev that if I submit a record with no null or empty values, then check the document with /nmdcschema there are keys with null values or empty arrays. Based on this we believe the /v1/workflows/activites API endpoint is adding these keys and null values and the desired behavior is to leave these records as they are.

Test record for /v1/workflows/activites tested with runtime dev {"read_qc_analysis_activity_set":[{ "id": "nmdc:wfrqc-11-t0tvnp52.2", "name": "Read QC Activity for nmdc:wfrqc-11-t0tvnp52.1", "started_at_time": "2024-01-11T20:48:30.718133+00:00", "ended_at_time": "2024-01-11T21:11:44.884260+00:00", "was_informed_by": "nmdc:omprc-11-9mvz7z22", "execution_resource": "NERSC-Perlmutter", "git_url": "https://github.com/microbiomedata/ReadsQC", "has_input": [ "nmdc:dobj-11-gpthnj64" ], "has_output": [ "nmdc:dobj-11-w5dak635", "nmdc:dobj-11-g6d71n77", "nmdc:dobj-11-bds7qq03" ], "type": "nmdc:ReadQcAnalysisActivity", "part_of": [ "nmdc:omprc-11-9mvz7z22" ], "version": "v1.0.8" }]}

then calling https://api-dev.microbiomedata.org/nmdcschema/read_qc_analysis_activity_set?filter=%7B%22id%22%3A%22nmdc%3Awfrqc-11-t0tvnp52.2%22%7D&max_page_size=20 I get a response body that contains null values (key 'used' for example)

{ "resources": [ { "id": "nmdc:wfrqc-11-t0tvnp52.2", "name": "Read QC Activity for nmdc:wfrqc-11-t0tvnp52.1", "started_at_time": "2024-01-11T20:48:30.718133+00:00", "ended_at_time": "2024-01-11T21:11:44.884260+00:00", "was_informed_by": "nmdc:omprc-11-9mvz7z22", "used": null, "execution_resource": "NERSC-Perlmutter", "git_url": "https://github.com/microbiomedata/ReadsQC", "has_input": [ "nmdc:dobj-11-gpthnj64" ], "type": "nmdc:ReadQcAnalysisActivity", "has_output": [ "nmdc:dobj-11-w5dak635", "nmdc:dobj-11-g6d71n77", "nmdc:dobj-11-bds7qq03" ], "part_of": [ "nmdc:omprc-11-9mvz7z22" ], "version": "v1.0.8", "qc_status": null, "qc_comment": null, "has_failure_categorization": [], "input_read_count": null, "input_base_count": null, "output_read_count": null, "output_base_count": null, "input_read_bases": null, "output_read_bases": null } ] }

This is high priority to fix b/c this is creating invalid records, the response body json does not validate with the json:submit endpoint (see #462) cc @pkalita-lbl @shreddd @ssarrafan @Michal-Babins @mbthornton-lbl @hubin-keio

PeopleMakeCulture commented 4 months ago

Thanks @aclum. I'll get it on our project board

PeopleMakeCulture commented 4 months ago

@aclum I tried to reproduce your submission and am getting a 500 internal server error. Other endpoints on api-dev are not returning 500s. Looking into why I'm not able to reproduce.

image image

PeopleMakeCulture commented 4 months ago

Here is where the post request is defined: https://github.com/microbiomedata/nmdc-runtime/blob/682a53c2d6e453da0e5d5a210ba87afb29be2bf7/nmdc_runtime/api/v1/workflows/activities.py#L26

PeopleMakeCulture commented 4 months ago

Here is where the activity records are getting created and inserted into mongodb: https://github.com/microbiomedata/nmdc-runtime/blob/682a53c2d6e453da0e5d5a210ba87afb29be2bf7/nmdc_runtime/api/v1/workflows/activities.py#L26

I believe this is the code that is inserting records with null values into mongo.

def insert_activities(activities: Database, mdb: MongoDatabase) -> bool:
    """Description."""
    activity_fields = fields(activities)
    for field in activity_fields:
        if activities[field.name]:
            collection = mdb.get_collection(field.name)
            collection.insert_many(
                [
                    json.loads(json.dumps(asdict(activity), default=str))
                    for activity in activities[field.name]
                ]
            )
    return True
PeopleMakeCulture commented 4 months ago

@aclum @shreddd Our plan is to deprecate this v1 endpoint and replace it with a refactored endpoint that uses the logic underlying json:submit.

In the meantime, please use dagster to run a submission job

fysa @dwinston

aclum commented 4 months ago

What is the depreciation plan/timeline? Workflow automation uses this endpoint routinely they'll need time to make updates and using the dagster UI is not desirable.

PeopleMakeCulture commented 4 months ago

@aclum We plan to push a fix this afternoon or latest by tomorrow

aclum commented 4 months ago

Great, thanks so much.

PeopleMakeCulture commented 4 months ago

@aclum we have a minimally-tested working solution on this branch: https://github.com/microbiomedata/nmdc-runtime/tree/478-refactor-v1-workflows-activities-endpoint

There's still some clean-up and more robust testing that I'd like to do in the next few hours, but if it's urgent we can push this into main and deploy to dev

Michal-Babins commented 4 months ago

Feel free to test it and clean it up! We can wait until tomorrow.

PeopleMakeCulture commented 4 months ago

@dwinston I would like to move the relevant code from /nmdc_runtime/api/v1/workflows/activities.py into nmdc_runtime/api/models/workflow.py and delete the entire v1 sub-directory.

I went through the files in v1 and did not find anything that is currently in use. Any objections or concerns for moving the new endpoint from /v1/workflows/activities to /workflows/activities and deprecating (ie deleting) v1?

dwinston commented 4 months ago

@PeopleMakeCulture please ensure POST /v1/workflows/activities works without error -- maybe post a deprecation notice in its docstring. We need to get OKs from @Michal-Babins et al. to ensure transition of their client code, which we don't want to break yet.

dwinston commented 4 months ago

you can ensure this however you feel is prudent, e.g. not having a v1 folder in the repo but still having it in a URL path for the endpoint.

Michal-Babins commented 4 months ago

So in the future, all posts to the DB should go through json:submit?

aclum commented 4 months ago

@Michal-Babins https://github.com/Michal-Babins At some point we'd move to /workflows/activities, although you raise a good point, if these endpoints are redundant perhaps longer term one can be deprecated entirely. For now continue to use /v1/workflows/activities. We can test the fix tomorrow.

On Tue, Feb 13, 2024 at 2:03 PM Michal Babinski @.***> wrote:

So in the future, all posts to the DB should go through json:submit?

— Reply to this email directly, view it on GitHub https://github.com/microbiomedata/nmdc-runtime/issues/478#issuecomment-1942715820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6RD3YE4MI334ZIBH7MWLDYTPPJNAVCNFSM6AAAAABDFSCT52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSG4YTKOBSGA . You are receiving this because you were mentioned.Message ID: @.***>

PeopleMakeCulture commented 4 months ago

Here is the current plan:

Step 1:

Update v1/workflows/activities to use the json_validate utility underlying json:submit. This should not affect downstream users at all. Downstream applications will still call the API endpoint exactly as before, and the API will return the same response as before. This fix is only intended to modify the documents inserted into Mongo such that new documents do not contain null values.

This step will be completed by #479 and merged into api-dev later today.

Step 2:

Create a new endpoint workflows/activities. Currently, this endpoint mimics the behavior of v1/workflows/activities.

We are considering an update to the return request such that successful POST requests return the json values inserted. This would mimic the behavior of endpoints such as POST: /objects or POST: /metadata/changesheets:submit. Feedback will be solicited at weekly check-in.

479 will create a new endpoint that duplicates v1/workflows/activities. A future PR can modify the response object for the new endpoint only if that is deemed useful.

Step 3:

Once all downstream jobs are full migrated to the new workflows/activities endpoint, the v1 endpoint will be removed from the runtime API.

For the time being we will maintain both endpoints. Ideally I would like to push the new endpoint in this release cycle and deprecate v1 endpoints in the next release cycle.

@aclum @Michal-Babins @dwinston does that sound okay?

dwinston commented 4 months ago

The plan sounds good to me. I think we could consolidate to json:submit, assuming we refactor it to make it synchronous (which we can do), but there may be preference among folks to retain a dedicated endpoint with a convenient name.

Michal-Babins commented 4 months ago

Yeah, either is fine with me. I am also happy to meet and go over the use and purpose of this endpoint, and seeing if it's easier to consolidate to json:submit. For us it would be a quick refactor for what endpoint we need to use, so I would go for whatever would be easier to maintain.

PeopleMakeCulture commented 4 months ago

@dwinston @Michal-Babins which endpoints would we consolidate into json:submit? Just workflows/activities or others as well?

dwinston commented 4 months ago

I think that's the only one

PeopleMakeCulture commented 4 months ago

@aclum @Michal-Babins This fix has been pushed to prod. Runtime API is now on version 1.3.1 with the updated /workflows/activities and v1/workflows/activities endpoints.

Release notes in Slack here: https://nmdc-group.slack.com/archives/C063UHW5AJE/p1708011488437899

aclum commented 4 months ago

confirmed fixed on runtime dev API. This should also be in prod as of the 2024.2 release on Monday.