microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
4 stars 3 forks source link

Error `KeyError: '$ref'` occurred on Dagster after using `/metadata/json:submit` endpoint #378

Closed eecavanna closed 8 months ago

eecavanna commented 8 months ago

@brynnz22 submitted a JSON file to the /metadata/json:submit endpoint.

The corresponding run on Dagit failed:

image

Dagit shows the following error (here is a full stack trace):

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "perform_mongo_updates":

  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_plan.py", line 275, in dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 476, in core_dagster_event_sequence_for_step
    for user_event in _step_output_error_checked_user_event_sequence(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 159, in _step_output_error_checked_user_event_sequence
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 94, in _process_asset_results_to_events
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute.py", line 204, in execute_core_compute
    for step_output in _yield_compute_results(step_context, inputs, compute_fn):
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute.py", line 173, in _yield_compute_results
    for event in iterate_with_context(
  File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 459, in iterate_with_context
    with context_fn():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/utils.py", line 84, in op_execution_error_boundary
    raise error_cls(

The above exception was caused by the following exception:
KeyError: '$ref'

  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 461, in iterate_with_context
    next_output = next(iterator)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 131, in _coerce_op_compute_fn_to_iterator
    result = invoke_compute_fn(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 125, in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
  File "/opt/dagster/lib/nmdc_runtime/site/ops.py", line 540, in perform_mongo_updates
    docs, _ = specialize_activity_set_docs(docs)
  File "/opt/dagster/lib/nmdc_runtime/util.py", line 289, in specialize_activity_set_docs
    type_collections = get_type_collections()
  File "/opt/dagster/lib/nmdc_runtime/util.py", line 32, in get_type_collections
    return {
  File "/opt/dagster/lib/nmdc_runtime/util.py", line 33, in <dictcomp>
    f'nmdc:{spec["items"]["$ref"].split("/")[-1]}': collection_name

Here's what the offending line of code contains:

https://github.com/microbiomedata/nmdc-runtime/blob/1eacc43922104be514d626ac4831114f4378d2e1/nmdc_runtime/util.py#L30-L36

nmdc_jsonschema is defined further down the file:

https://github.com/microbiomedata/nmdc-runtime/blob/1eacc43922104be514d626ac4831114f4378d2e1/nmdc_runtime/util.py#L68

CC: @dwinston

eecavanna commented 8 months ago

I think I found the next level of the issue. In an (unrelated) Python notebook, I ran:

from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict. # nmdc_schema v9.1.0

nmdc_jsonschema = get_nmdc_jsonschema_dict()
for collection_name, spec in nmdc_jsonschema["properties"].items():
    print(collection_name, type(spec), spec["items"])

image

Not all of the items dictionaries have a $ref key at the top level.

Here's the full output of that notebook cell:

activity_set <class 'dict'> {'$ref': '#/$defs/WorkflowExecutionActivity'}
biosample_set <class 'dict'> {'$ref': '#/$defs/Biosample'}
collecting_biosamples_from_site_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/CollectingBiosamplesFromSite'}]}
data_object_set <class 'dict'> {'$ref': '#/$defs/DataObject'}
dissolving_activity_set <class 'dict'> {'$ref': '#/$defs/DissolvingActivity'}
extraction_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/Extraction'}]}
field_research_site_set <class 'dict'> {'$ref': '#/$defs/FieldResearchSite'}
functional_annotation_agg <class 'dict'> {'$ref': '#/$defs/FunctionalAnnotationAggMember'}
functional_annotation_set <class 'dict'> {'$ref': '#/$defs/FunctionalAnnotation'}
genome_feature_set <class 'dict'> {'$ref': '#/$defs/GenomeFeature'}
library_preparation_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/LibraryPreparation'}]}
mags_activity_set <class 'dict'> {'$ref': '#/$defs/MagsAnalysisActivity'}
material_sample_set <class 'dict'> {'$ref': '#/$defs/MaterialSample'}
material_sampling_activity_set <class 'dict'> {'$ref': '#/$defs/MaterialSamplingActivity'}
metabolomics_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/MetabolomicsAnalysisActivity'}
metagenome_annotation_activity_set <class 'dict'> {'$ref': '#/$defs/MetagenomeAnnotationActivity'}
metagenome_assembly_set <class 'dict'> {'$ref': '#/$defs/MetagenomeAssembly'}
metagenome_sequencing_activity_set <class 'dict'> {'$ref': '#/$defs/MetagenomeSequencingActivity'}
metaproteomics_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/MetaproteomicsAnalysisActivity'}
metatranscriptome_activity_set <class 'dict'> {'$ref': '#/$defs/MetatranscriptomeActivity'}
nom_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/NomAnalysisActivity'}
omics_processing_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/OmicsProcessing'}]}
planned_process_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/CollectingBiosamplesFromSite'}, {'$ref': '#/$defs/BiosampleProcessing'}, {'$ref': '#/$defs/SubSamplingProcess'}, {'$ref': '#/$defs/OmicsProcessing'}, {'$ref': '#/$defs/Pooling'}, {'$ref': '#/$defs/Extraction'}, {'$ref': '#/$defs/LibraryPreparation'}]}
pooling_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/Pooling'}]}
processed_sample_set <class 'dict'> {'$ref': '#/$defs/ProcessedSample'}
reaction_activity_set <class 'dict'> {'$ref': '#/$defs/ReactionActivity'}
read_based_taxonomy_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/ReadBasedTaxonomyAnalysisActivity'}
read_qc_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/ReadQcAnalysisActivity'}
study_set <class 'dict'> {'$ref': '#/$defs/Study'}
dwinston commented 8 months ago

right. from nmdc-schema==9.1.0's nmdc_materialized_patterns.schema.json,

for collection_name, spec in nmdc_jsonschema["properties"].items():
    if collection_name.endswith("_set"):
        if not spec["items"].get("$ref"):
            print(collection_name)
collecting_biosamples_from_site_set
extraction_set
library_preparation_set
omics_processing_set
planned_process_set
pooling_set

and by inspection, this is because of the new-to-me use of anyOf for array-typed items referencing, i.e. I has only coded for the case of e.g. processed_sample_set below, rather than e.g. pooling_set:

"pooling_set": {
    "items": {
        "anyOf": [
            {
                "$ref": "#/$defs/Pooling"
            }
        ]
    },
    "type": "array"
},
"processed_sample_set": {
    "description": "This property links a database object to the set of processed samples within it.",
    "items": {
        "$ref": "#/$defs/ProcessedSample"
    },
    "type": "array"
},
dwinston commented 8 months ago

looks like there are 3 uses of $ref in nmdc_runtime.util, and one in nmdc_runtime.api.endpoints.find, that need to be reworked (ideally routed through a new common helper function) to deal with the anyOf case.

eecavanna commented 8 months ago

I drafted this "fixed" function:

@lru_cache
def get_type_collections():
    """Returns a dictionary mapping class names to Mongo collection names"""
    mappings = {}

    def get_class_name_from_ref(s: str):
        return s.split("/")[-1]

    for collection_name, spec in nmdc_jsonschema["properties"].items():
        if collection_name.endswith("_set"):
            items = spec["items"]
            if "$ref" in items:
                ref = items["$ref"]
                class_name = get_class_name_from_ref(ref)
                mappings[class_name] = collection_name
            elif "anyOf" in items and isinstance(items["anyOf"], list):
                for item in items["anyOf"]:
                    ref = item["$ref"]
                    class_name = get_class_name_from_ref(ref)
                    mappings[class_name] = collection_name

    return mappings
dwinston commented 8 months ago

@eecavanna do you want to take a stab at this with a PR?

eecavanna commented 8 months ago

Sure, I'll open it in a couple mins.

eecavanna commented 8 months ago

Heads up: Looks like there is more code in that file that will be affected by the presence of anyOf.

https://github.com/microbiomedata/nmdc-runtime/blob/1eacc43922104be514d626ac4831114f4378d2e1/nmdc_runtime/util.py#L315-L321

Oops! You already pointed this out, in general, here.

eecavanna commented 8 months ago

Here's the JSON content that was submitted to the /metadata/json:submit endpoint when the initial error occurred:

Click to expand/collapse JSON snippet.
_Note: There are a couple `"description"` values that GitHub is not color-coding. I think it's only because (hypothetically) the lines are too long for GitHub's syntax highlighter._ ```json { "study_set": [ { "id": "nmdc:sty-11-2zhqs261", "description": "Climate change, extreme weather, land-use change, and other perturbations are significantly reshaping interactions among the vegetation, soil, fluvial, and subsurface compartments of watersheds throughout the world. Watersheds are recognized as Earth's key functional unit for managing water resources, but their hydrological interactions also mediate biogeochemical processes that support all terrestrial life. These complex interactions, which occur within a heterogeneous landscape can lead to a cascade of effects on downstream water availability, nutrient and metal loading, and carbon cycling. Despite significant implications for energy production, agriculture, water quality, and other societal benefits important to U.S. Department of Energy (DOE) energy and environmental missions, uncertainty associated with predicting watershed function and dynamics remains high. To address this uncertainty, the Watershed Function Scientific Focus Area (SFA) is developing a predictive understanding of how mountainous watersheds retain and release water, nutrients, carbon, and metals. In particular, the SFA is developing understanding and tools to measure and predict how droughts, early snowmelt, and other perturbations impact downstream water availability and biogeochemical cycling at episodic to decadal timescales.", "funding_sources": [ "The U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under contract No. DE-AC02-05CH11231" ], "title": "Watershed Function Scientific Focus Area", "websites": [ "https://watershed.lbl.gov/" ], "study_category": "research_study" }, { "id": "nmdc:sty-11-xcbexm97", "description": "The Worldwide Hydrobiogeochemistry Observation Network for Dynamic River Systems (WHONDRS) is a research consortium that aims to understand coupled hydrologic, biogeochemical, and microbial function within river corridors, with an emphasis on increasing accessibility of resources and knowledge throughout the research life cycle. WHONDRS seeks to galvanize a global community around understanding these coupled systems from local to global scales and ultimately to provide the scientific basis for improved management of dynamic river corridors throughout the world.", "funding_sources": [ "The U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under contract DE-AC05-76RL01830. WHONDRS is part of PNNL's River Corridor Hydrobiogeochemistry SFA." ], "title": "WHONDRS", "websites": [ "https://www.pnnl.gov/projects/WHONDRS" ], "study_category": "consortium" }, { "id": "nmdc:sty-11-x4aawf73", "description": "The Pacific Northwest National Laboratory (PNNL) River Corridor Hydrobiogeochemistry Scientific Focus Area (SFA) works to transform understanding of spatial and temporal dynamics in river corridor hydrobiogeochemical functions from molecular reaction to watershed and basin scales. The knowledge we gain is used to improve mechanistic representation of river corridor processes, and their response to disturbances, in multiscale models of integrated hydrobiogeochemical function.", "funding_sources": [ "The U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under contract DE-AC05-76RL01830" ], "title": "River Corridors Scientific Focus Area", "websites": [ "https://www.pnnl.gov/projects/river-corridor" ], "study_category": "research_study" }, { "id": "nmdc:sty-11-f1he1955", "description": "The goal of the Plant-Microbe Interfaces SFA is to gain a deeper understanding of the diversity and functioning of mutually beneficial interactions between plants and microbes in the rhizosphere. The plant-microbe interface is the boundary across which a plant senses, interacts with, and may alter its associated biotic and abiotic environments. Understanding the exchange of energy, information, and materials across the plant-microbe interface at diverse spatial and temporal scales is our ultimate objective. Our ongoing efforts focus on characterizing and interpreting such interfaces using systems comprising plants and microbes, in particular the poplar tree (Populus) and its microbial community in the context of favorable plant microbe interactions. We seek to define the relationships among these organisms in natural settings, dissect the molecular signals and gene-level responses of the organisms using natural and model systems, and interpret this information using advanced computational tools.", "funding_sources": [ "The U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under contract DE-AC05-00OR22725." ], "title": "Plant-Microbe Interfaces Scientific Focus Area", "websites": [ "https://pmiweb.ornl.gov/" ], "study_category": "research_study" }, { "id": "nmdc:sty-11-cytnjc39", "description": "The Terrestrial Ecosystem Science SFA supports research to understand and predict the interaction of Earth's terrestrial ecosystems and climate, and to assess vulnerability of terrestrial ecosystems to projected environmental change. The research focuses on how terrestrial ecosystems affect atmospheric CO2 and other greenhouse gases (e.g., CH4) and how the responsible ecosystem processes interact with climate and with anthropogenic forcing factors. ", "funding_sources": [ "The U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under contract DE-AC05-00OR22725." ], "title": "Terrestrial Ecosystem Scientific Focus Area", "websites": [ "https://tes-sfa.ornl.gov/" ], "study_category": "research_study" }, { "id": "nmdc:sty-11-msexsy29", "description": "The LLNL Soil Microbiome Scientific Focus Area (SFA)—Microbes Persist: Systems Biology of the Soil Microbiome—seeks to understand how microbial ecophysiology, population dynamics, and microbe–mineral–organic matter interactions regulate the persistence of microbial residues in soil under changing moisture regimes. Members of the soil microbiome (bacteria, archaea, fungi, microfauna, and viruses) play key roles in soil carbon turnover and the stabilization of persistent organic matter via their metabolic activities, cellular biochemistry, and extracellular products. Soils store more carbon than the atmosphere and biosphere combined, yet the mechanisms that regulate soil carbon remain elusive. Microbial residues are a primary ingredient in soil organic matter (SOM), a pool that is critical to agriculture, healthy ecosystems, and Earth's climate.", "funding_sources": [ "The U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under contractDE-AC52- 07NA27344." ], "title": "Microbes Persist Scientific Focus Area", "websites": [ "https://sc-programs.llnl.gov/biological-and-environmental-research-at-llnl/soil-microbiome" ], "study_category": "research_study" }, { "id": "nmdc:sty-11-nxrz9m96", "description": "The National Science Foundation's National Ecological Observatory Network (NEON) is a continental-scale observation facility operated by Battelle and designed to collect long-term open access ecological data to better understand how U.S. ecosystems are changing. NEON monitors ecosystems across the United States. Freshwater ecosystems include streams, rivers, and lakes while terrestrial ecosystems span from deserts to tropical forests.", "funding_sources": [ "The National Ecological Observatory Network is a major facility fully funded by the National Science Foundation, NSF#1724433." ], "title": "National Ecological Observatory Network (NEON)", "websites": [ "https://www.neonscience.org/" ], "study_category": "consortium" } ] } ```
eecavanna commented 8 months ago

I confirmed Dagster runs apply_metadata_in OK with that JSON content, as of the following (unmerged) commits, on my local machine:

eecavanna commented 8 months ago

The fix for this issue was deployed to production as part of nmdc-runtime version v1.0.10.