microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
5 stars 3 forks source link

`berkeley`: Dagster does not insert data into Mongo despite `json:submit` returning success response #595

Closed aclum closed 1 month ago

aclum commented 1 month ago

Describe the bug When I run a test json:submit job it submits successfully (ie I get this response body

{
  "type": "success",
  "detail": {
    "run_id": "nmdc:sys0mb9dfb31"
  }
}

However, the submission never gets entered into mongo. When I check dagit I find an error about the client id. I believe this needs to be fixed in runtime, normally if my client ID is not correct I wouldn't get a success response body.

To Reproduce Steps to reproduce the behavior:

  1. Test request body with json:submit endpoint w/ swagger UI
 {
  "workflow_execution_set": [
    {
      "id": "nmdc:wfmgas-99-B7Vogx.2",
      "type": "nmdc:MetagenomeAssembly",
      "was_informed_by": "nmdc:omprc-12-x123fa",
      "name": "Metagenome assembly for nmdc:omprc-12-x123fa",
      "started_at_time": "2020-03-24T00:00:00+00:00",
      "ended_at_time": "2020-03-25T00:00:00+00:00",
      "execution_resource": "LANL-B-div",
      "git_url": "https://github.com/microbiomedata/metaAssembly/releases/tag/1.0.0",
      "has_input": [
        "nmdc:dobj-11-547rwq84"
      ],
      "has_output": [
        "nmdc:dobj-11-547rwq75",
        "nmdc:dobj-11-547rwq76",
        "nmdc:dobj-11-547rwq77",
        "nmdc:dobj-11-547rwq78",
        "nmdc:dobj-11-547rwq79"
      ],
      "scaffolds": 429340,
      "contigs": 429340,
      "scaf_bp": 192123121,
      "contig_bp": 192123121,
      "gap_pct": 0,
      "scaf_n50": 132307,
      "scaf_l50": 433,
      "ctg_n50": 132307,
      "ctg_l50": 433,
      "scaf_n90": 357156,
      "scaf_l90": 288,
      "ctg_n90": 357156,
      "ctg_l90": 288,
      "scaf_logsum": 303893,
      "scaf_powsum": 32467,
      "ctg_logsum": 303893,
      "ctg_powsum": 32467,
      "asm_score": 3.29,
      "scaf_max": 17245,
      "ctg_max": 17245,
      "scaf_n_gt50k": 0,
      "scaf_l_gt50k": 0,
      "scaf_pct_gt50k": 0,
      "gc_avg": 0.55402,
      "gc_std": 0.09822,
      "num_input_reads": 87803950,
      "num_aligned_reads": 63046103
    }
  ]
}
  1. check dagit to make sure dagster apply_metadata_in job was successful error from this run https://dagit-berkeley.microbiomedata.org/runs/5dff9e63-c445-4150-add0-0852d0807e9d

Expected behavior If successful, the below query would return 1 record.

curl -X 'GET' \
  'https://api-berkeley.microbiomedata.org/nmdcschema/workflow_execution_set?filter=%7B%22id%22%3A%22nmdc%3Awfmgas-99-B7Vogx.2%22%7D&max_page_size=20' \
  -H 'accept: application/json'

Acceptance Criteria -A json:submit submission using https://api-berkeley.microbiomedata.org/docs can a few minutes later be found with the https://api-berkeley.microbiomedata.org/nmdcschema endpoint.

eecavanna commented 1 month ago

Until a few minutes ago, Dagster in the Berkeley environment was using an incorrect username to access Mongo (it was using the username root, while it was using the password of a non-root account; the latter of which was intentional).

Now that that has been resolved, I expect all errors stemming from Dagster's inability to access the Mongo database, to also be resolved.

However, I don't know that this particular issue stems from an inability to access the Mongo database.

Edit: I confirmed the token-related issue persists.

eecavanna commented 1 month ago

Here's a copy/paste of the stack trace shown on Dagit:

  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 468, in iterate_with_context
    next_output = next(iterator)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 141, in _coerce_op_compute_fn_to_iterator
    result = invoke_compute_fn(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 129, in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
  File "/opt/dagster/lib/nmdc_runtime/site/ops.py", line 451, in get_json_in
    rv = client.get_object_bytes(object_id)
  File "/opt/dagster/lib/nmdc_runtime/site/resources.py", line 225, in get_object_bytes
    obj = DrsObject(**self.get_object_info(object_id).json())
  File "/opt/dagster/lib/nmdc_runtime/site/resources.py", line 219, in get_object_info
    return self.request("GET", f"/objects/{object_id}")
  File "/opt/dagster/lib/nmdc_runtime/site/resources.py", line 60, in request
    self.ensure_token()
  File "/opt/dagster/lib/nmdc_runtime/site/resources.py", line 43, in ensure_token
    self.get_token()
  File "/opt/dagster/lib/nmdc_runtime/site/resources.py", line 52, in get_token
    raise Exception(f"Getting token failed: {self.token_response}")

Here's the offending code:

    def get_token(self):
        rv = requests.post(self.base_url + "/token", data=self.get_token_request_body())
        self.token_response = rv.json()
        if "access_token" not in self.token_response:
            raise Exception(f"Getting token failed: {self.token_response}")
eecavanna commented 1 month ago

I confirmed that the Dagster container can access the Runtime container via the URL in the ConfigMap Dagster is using (i.e. the URL, http://runtime-api:8000).

root@dagster-daemon-684f644dd5-rt9k9:/opt/dagster/dagster_home# curl http://runtime-api:8000/version ; echo "" ;
{"nmdc-runtime":"1.7.1.dev43+g19dd7dc","fastapi":"0.111.0","nmdc-schema":"11.0.0rc16"}
eecavanna commented 1 month ago

I'm wrapping things up before I go OOO for the next few days (back next Wednesday). I'll transfer ownership to @dwinston and @PeopleMakeCulture and then check back in on this issue when I'm back in the office.

eecavanna commented 1 month ago

@brynnz22 reported the same symptom in https://github.com/microbiomedata/issues/issues/750#issuecomment-2258978548. The specific error message she cited was:

Exception: Getting token failed: {'detail': 'Incorrect client_id or client_secret'}
dwinston commented 1 month ago

@eecavanna I noticed on inspecting the berkeley mongo database that it seems the sites collection was copied over. Thus, I suspect that the API_SITE_ID site (maintained as "nmdc-runtime" in the berkeley environment's rancher configmap) was found to be present and thus not created with the new API_SITE_CLIENT_SECRET value supplied (by you?) as a rancher secret on startup: https://github.com/microbiomedata/nmdc-runtime/blob/berkeley/nmdc_runtime/api/main.py#L323

My quick fix was to reset API_SITE_CLIENT_SECRET to be the same as that of the copied-over runtime-api site document and to redeploy the dagster-daemon and dagster-dagit workloads in rancher to pick up the new value of the env var. I then re-executed a recent apply_metadata_in job via https://dagit-berkeley.microbiomedata.org and it succeeded.

eecavanna commented 1 month ago

Thanks, @dwinston!

Yes, the sites collection (as well as all other collections) was copied over (although the ones described by the schema were "migrated" between the initial "copy" step and the final "paste" step).

The fix makes sense to me.

I will check whether I have notes about those environment variables from when I set up this environment. I'm surprised this issue arose with this Runtime instance, while not arising with other Runtime instances I've set up.

aclum commented 1 month ago

Confirmed fixed.