microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
7 stars 3 forks source link

`berkeley`: Function `ensure_schema_collections_and_alldocs` is incompatible with Berkeley schema #658

Closed aclum closed 2 months ago

aclum commented 2 months ago

This came up in a PR where Patrick was working on updating ETL code. there are errors about associated_studies being missing from biosample_set, upon further inspection it is trying to compare prod mongo, I inferred this based on it pulling in collection names which don't exist in berkeley like omics_processing_set, with the berkeley nmdc-schema release candidate release

We need an alldocs in mongo berkeley and for this test to point to that.

cc @dwinston @eecavanna @pkalita-lbl

aclum commented 2 months ago

hoping Donny or Eric can look at this in @PeopleMakeCulture absence. cc @shreddd

eecavanna commented 2 months ago

Here's an excerpt from the logs of the GitHub Actions workflow run that @aclum linked to above (thanks!):

Show/hide excerpt ```py wait-for-it.sh: fastapi:8000 is available after 0 seconds /usr/local/lib/python3.10/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) ============================= test session starts ============================== platform linux -- Python 3.10.14, pytest-8.3.2, pluggy-1.5.0 rootdir: /code plugins: anyio-4.4.0, asyncio-0.24.0, requests-mock-1.12.1, cov-5.0.0, mock-3.14.0 asyncio: mode=strict, default_loop_scope=None collected 109 items demo/metadata_migration/notebooks/test_helpers.py . [ 0%] tests/e2e/test_minter_api.py xxxx [ 4%] tests/integration/test_minter_repository.py ....X... [ 11%] tests/test_api/test_db_mongo.py ..... [ 16%] tests/test_api/test_endpoints.py ssE ==================================== ERRORS ==================================== _______________ ERROR at setup of test_metadata_validate_json_0 ________________ @pytest.fixture def api_site_client(): mdb = get_mongo_db() > rs = ensure_test_resources(mdb) tests/test_api/test_endpoints.py:204: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ tests/test_api/test_endpoints.py:121: in ensure_test_resources ensure_schema_collections_and_alldocs() tests/test_api/test_endpoints.py:67: in ensure_schema_collections_and_alldocs materialize_alldocs( /usr/local/lib/python3.10/site-packages/dagster/_core/definitions/op_definition.py:457: in __call__ return direct_invocation_result(self, *args, **kwargs) /usr/local/lib/python3.10/site-packages/dagster/_core/definitions/op_invocation.py:217: in direct_invocation_result result = invoke_compute_fn( /usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py:129: in invoke_compute_fn return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass) nmdc_runtime/site/ops.py:1025: in materialize_alldocs raise e nmdc_runtime/site/ops.py:1018: in materialize_alldocs nmdcdb = NMDCDatabase( :21: in __init__ ??? /usr/local/lib/python3.10/site-packages/nmdc_schema/nmdc.py:440: in __post_init__ self._normalize_inlined_as_list(slot_name="biosample_set", slot_type=Biosample, key_name="id", keyed=True) /usr/local/lib/python3.10/site-packages/linkml_runtime/utils/yamlutils.py:105: in _normalize_inlined_as_list self._normalize_inlined(slot_name, slot_type, key_name, keyed, True) /usr/local/lib/python3.10/site-packages/linkml_runtime/utils/yamlutils.py:190: in _normalize_inlined cooked_obj = slot_type(**as_dict(list_entry)) :589: in __init__ ??? /usr/local/lib/python3.10/site-packages/nmdc_schema/nmdc.py:2120: in __post_init__ self.MissingRequiredField("associated_studies") _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = Biosample(id='nmdc:bsm-13-amrnys72', type='nmdc:Biosample', name='Sand microcosm microbial communities from a hyporhei...ample_type=None, technical_reps=None, analysis_type=[], sample_link=[], bulk_elect_conductivity=None, infiltrations=[]) field_name = 'associated_studies' def MissingRequiredField(self, field_name: str) -> None: """ Generic loader error handler """ > raise ValueError(f"{field_name} must be supplied") E ValueError: associated_studies must be supplied /usr/local/lib/python3.10/site-packages/linkml_runtime/utils/yamlutils.py:281: ValueError ---------------------------- Captured stdout setup ----------------------------- local mongodump already exists at /code/tests/nmdcdb ```

For reference, that PR's base branch is the berkeley branch.

eecavanna commented 2 months ago

The error occurred during the "setup" phase of the following test:

test_metadata_validate_json_0 https://github.com/microbiomedata/nmdc-runtime/blob/2c300cc7727416847aeae08e5e4802d32f275962/tests/test_api/test_endpoints.py#L208-L220

Specifically, it occurred while the following fixture was being built:

@pytest.fixture
def api_site_client():
    mdb = get_mongo_db()
    rs = ensure_test_resources(mdb)  # <-- the error occurred here
    return RuntimeApiSiteClient(base_url=os.getenv("API_HOST"), **rs["site_client"])

ensure_test_resources is defined here. Among other things, it calls a function named ensure_schema_collections_and_alldocs, which is defined here.

eecavanna commented 2 months ago

Here's an excerpt from the ensure_schema_collections_and_alldocs function:

    dump_dir = ensure_local_mongodump_exists()
    mongorestore_from_dir(mdb, dump_dir, skip_collections=["functional_annotation_agg"])

It calls a function named ensure_local_mongodump_exists (side note: I prefer the inclusion of this _exists suffix, which says what about it someone is ensuring).

That function (and the constant it accesses) is defined here:

https://github.com/microbiomedata/nmdc-runtime/blob/2c300cc7727416847aeae08e5e4802d32f275962/tests/test_api/test_endpoints.py#L31-L51

Looks like that function downloads a pre-made database dump from the NERSC filesystem.

eecavanna commented 2 months ago

The value of SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_BASENAME is the filename of a dump of the production database, specifically. I think this is consistent with what @aclum reported in the issue description.

eecavanna commented 2 months ago

In other words, I think this couples the test suite to the specific schema that happened to be in effect when that specific dump was generated.

I think this will be insufficient both for (a) the legacy-to-Berkeley roll out we are working on now; and (b) changes to certain aspects of the schema in the future.

eecavanna commented 2 months ago

The "coupling" (as I referred to it in the previous message) was introduced in this commit. I will assign to @dwinston to take a pass at resolving it.

eecavanna commented 2 months ago

During today's infrastructure meeting, we discussed the possibility of disabling this problematic code (which may involve disabling some tests) in order to unblock some current work; and developers can work on implementing a long term fix in parallel. I'll look into this with that path forward in mind today.

eecavanna commented 2 months ago

I created a PR in which I disabled the problematic code. The PR is here: https://github.com/microbiomedata/nmdc-runtime/pull/664