Closed eecavanna closed 8 months ago
I think I found the next level of the issue. In an (unrelated) Python notebook, I ran:
from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict. # nmdc_schema v9.1.0
nmdc_jsonschema = get_nmdc_jsonschema_dict()
for collection_name, spec in nmdc_jsonschema["properties"].items():
print(collection_name, type(spec), spec["items"])
Not all of the items
dictionaries have a $ref
key at the top level.
Here's the full output of that notebook cell:
activity_set <class 'dict'> {'$ref': '#/$defs/WorkflowExecutionActivity'}
biosample_set <class 'dict'> {'$ref': '#/$defs/Biosample'}
collecting_biosamples_from_site_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/CollectingBiosamplesFromSite'}]}
data_object_set <class 'dict'> {'$ref': '#/$defs/DataObject'}
dissolving_activity_set <class 'dict'> {'$ref': '#/$defs/DissolvingActivity'}
extraction_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/Extraction'}]}
field_research_site_set <class 'dict'> {'$ref': '#/$defs/FieldResearchSite'}
functional_annotation_agg <class 'dict'> {'$ref': '#/$defs/FunctionalAnnotationAggMember'}
functional_annotation_set <class 'dict'> {'$ref': '#/$defs/FunctionalAnnotation'}
genome_feature_set <class 'dict'> {'$ref': '#/$defs/GenomeFeature'}
library_preparation_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/LibraryPreparation'}]}
mags_activity_set <class 'dict'> {'$ref': '#/$defs/MagsAnalysisActivity'}
material_sample_set <class 'dict'> {'$ref': '#/$defs/MaterialSample'}
material_sampling_activity_set <class 'dict'> {'$ref': '#/$defs/MaterialSamplingActivity'}
metabolomics_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/MetabolomicsAnalysisActivity'}
metagenome_annotation_activity_set <class 'dict'> {'$ref': '#/$defs/MetagenomeAnnotationActivity'}
metagenome_assembly_set <class 'dict'> {'$ref': '#/$defs/MetagenomeAssembly'}
metagenome_sequencing_activity_set <class 'dict'> {'$ref': '#/$defs/MetagenomeSequencingActivity'}
metaproteomics_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/MetaproteomicsAnalysisActivity'}
metatranscriptome_activity_set <class 'dict'> {'$ref': '#/$defs/MetatranscriptomeActivity'}
nom_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/NomAnalysisActivity'}
omics_processing_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/OmicsProcessing'}]}
planned_process_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/CollectingBiosamplesFromSite'}, {'$ref': '#/$defs/BiosampleProcessing'}, {'$ref': '#/$defs/SubSamplingProcess'}, {'$ref': '#/$defs/OmicsProcessing'}, {'$ref': '#/$defs/Pooling'}, {'$ref': '#/$defs/Extraction'}, {'$ref': '#/$defs/LibraryPreparation'}]}
pooling_set <class 'dict'> {'anyOf': [{'$ref': '#/$defs/Pooling'}]}
processed_sample_set <class 'dict'> {'$ref': '#/$defs/ProcessedSample'}
reaction_activity_set <class 'dict'> {'$ref': '#/$defs/ReactionActivity'}
read_based_taxonomy_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/ReadBasedTaxonomyAnalysisActivity'}
read_qc_analysis_activity_set <class 'dict'> {'$ref': '#/$defs/ReadQcAnalysisActivity'}
study_set <class 'dict'> {'$ref': '#/$defs/Study'}
right. from nmdc-schema==9.1.0
's nmdc_materialized_patterns.schema.json,
for collection_name, spec in nmdc_jsonschema["properties"].items():
if collection_name.endswith("_set"):
if not spec["items"].get("$ref"):
print(collection_name)
collecting_biosamples_from_site_set
extraction_set
library_preparation_set
omics_processing_set
planned_process_set
pooling_set
and by inspection, this is because of the new-to-me use of anyOf
for array-typed items referencing, i.e. I has only coded for the case of e.g. processed_sample_set
below, rather than e.g. pooling_set
:
"pooling_set": {
"items": {
"anyOf": [
{
"$ref": "#/$defs/Pooling"
}
]
},
"type": "array"
},
"processed_sample_set": {
"description": "This property links a database object to the set of processed samples within it.",
"items": {
"$ref": "#/$defs/ProcessedSample"
},
"type": "array"
},
looks like there are 3 uses of $ref
in nmdc_runtime.util
, and one in nmdc_runtime.api.endpoints.find
, that need to be reworked (ideally routed through a new common helper function) to deal with the anyOf
case.
I drafted this "fixed" function:
@lru_cache
def get_type_collections():
"""Returns a dictionary mapping class names to Mongo collection names"""
mappings = {}
def get_class_name_from_ref(s: str):
return s.split("/")[-1]
for collection_name, spec in nmdc_jsonschema["properties"].items():
if collection_name.endswith("_set"):
items = spec["items"]
if "$ref" in items:
ref = items["$ref"]
class_name = get_class_name_from_ref(ref)
mappings[class_name] = collection_name
elif "anyOf" in items and isinstance(items["anyOf"], list):
for item in items["anyOf"]:
ref = item["$ref"]
class_name = get_class_name_from_ref(ref)
mappings[class_name] = collection_name
return mappings
@eecavanna do you want to take a stab at this with a PR?
Sure, I'll open it in a couple mins.
Heads up: Looks like there is more code in that file that will be affected by the presence of anyOf
.
Oops! You already pointed this out, in general, here.
Here's the JSON content that was submitted to the /metadata/json:submit
endpoint when the initial error occurred:
I confirmed Dagster runs apply_metadata_in
OK with that JSON content, as of the following (unmerged) commits, on my local machine:
get_type_collections
function had been modified within the branch)The fix for this issue was deployed to production as part of nmdc-runtime
version v1.0.10.
@brynnz22 submitted a JSON file to the
/metadata/json:submit
endpoint.The corresponding run on Dagit failed:
Dagit shows the following error (here is a full stack trace):
Here's what the offending line of code contains:
https://github.com/microbiomedata/nmdc-runtime/blob/1eacc43922104be514d626ac4831114f4378d2e1/nmdc_runtime/util.py#L30-L36
nmdc_jsonschema
is defined further down the file:https://github.com/microbiomedata/nmdc-runtime/blob/1eacc43922104be514d626ac4831114f4378d2e1/nmdc_runtime/util.py#L68
CC: @dwinston