`berkeley`: Dagster job `ensure_alldocs` fails with `AssertionError`

eecavanna commented 1 month ago

Today, I visited the Dagit instance in the Berkeley environment (nmdc-berkeley namespace on Spin) and tried running the ensure_alldocs job.

While the materialize_alldocs op was running, an error occurred. Here's a screenshot of the error message, followed by a copy/paste of the same error message:

Show/hide copy/pasted error message

```py dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "materialize_alldocs": File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_plan.py", line 247, in dagster_event_sequence_for_step for step_event in check.generator(step_events): File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 500, in core_dagster_event_sequence_for_step for user_event in _step_output_error_checked_user_event_sequence( File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 184, in _step_output_error_checked_user_event_sequence for user_event in user_event_sequence: File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 88, in _process_asset_results_to_events for user_event in user_event_sequence: File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute.py", line 198, in execute_core_compute for step_output in _yield_compute_results(step_context, inputs, compute_fn, compute_context): File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute.py", line 167, in _yield_compute_results for event in iterate_with_context( File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 476, in iterate_with_context with context_fn(): File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/utils.py", line 84, in op_execution_error_boundary raise error_cls( The above exception was caused by the following exception: AssertionError: configuration_set collection has class name of ['ChromatographyConfiguration', 'MassSpectrometryConfiguration'] and len 2 File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary yield File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 478, in iterate_with_context next_output = next(iterator) File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 141, in _coerce_op_compute_fn_to_iterator result = invoke_compute_fn( File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 129, in invoke_compute_fn return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass) File "/opt/dagster/lib/nmdc_runtime/site/ops.py", line 1027, in materialize_alldocs len(collection_name_to_class_names[name]) == 1 ```

I think this was the first time that job had been run in the Berkeley environment. That is based upon what I see here, on the "Runs" page of Dagit:

Task

The task here is to make it so the ensure_alldocs job runs and an alldocs collection exists in the Berkeley Mongo database.

eecavanna commented 1 month ago

This issue is causing the following downstream issue: https://github.com/microbiomedata/nmdc-runtime/issues/689

aclum commented 1 month ago

There is no longer a 1:1 with collection names and class names. Since type is universally enforced we reduced the total number of collections. For example now the first leaf of children of PlannedProcess all have a corresponding collection. see https://microbiomedata.github.io/berkeley-schema-fy24/PlannedProcess/ so MaterialProcessing has a corresponding material_processing_set, etc.

This is high priority b/c this backs production endpoints and the is needed for the ncbi export code.

eecavanna commented 1 month ago

I think @PeopleMakeCulture, @sujaypatil96, and I have a solid plan for fixing this. It will make the generation of alldocs take longer (by several minutes) because it involves processing documents one-by-one instead of treating every document in the collection as though they have identical class hierarchies. I plan to prototype the fix later today.

eecavanna commented 1 month ago

A fix is ready for review in https://github.com/microbiomedata/nmdc-runtime/pull/694.

microbiomedata / nmdc-runtime

`berkeley`: Dagster job `ensure_alldocs` fails with `AssertionError` #690

Task