Open aclum opened 1 year ago
Transferring to nmdc-schema per conversations with Mark. Suggestion was to set min cardinality in the schema. This needs to be coordinated with workflows so code that generates submissions gets updated. @corilo @Michal-Babins @mbthornton-lbl
@eecavanna offered to run a check in python to see what slots are populated with empty lists so we can reach out to the folks that write these documents.
This API URL shows an example of a record with an asserted empty gold_sequencing_project_identifiers
list:
{
"id": "emsl:739472",
"name": "Brodie_158_MeOH_R3_23Mar19_HESI_Neg",
"description": "High resolution MS spectra only",
"has_input": [
"igsn:IEWFS000S"
],
"has_output": [
"emsl:output_739472"
],
"part_of": [
"gold:Gs0135149"
],
"instrument_name": "21T_Agilent",
"omics_type": {
"has_raw_value": "Organic Matter Characterization"
},
"processing_institution": "EMSL",
"type": "nmdc:OmicsProcessing",
"gold_sequencing_project_identifiers": []
}
That doesn't make it into the YAML output of the pure-export
script in it's current state. I think the YAML serializer refuses to write keys with empty values, at least in its default ceonfiguration.
Therefore, the nmdc-schema repo isn't ready to check for this sort of thing right now.
@turbomam does this mean we should covert this back to a nmdc-runtime issue? Can the pypi package check this?
@eecavanna offered to run a check in python to see what slots are populated with empty lists so we can reach out to the folks that write these documents.
I want to clarify the requirements: Generate a list consisting of the id
of every document—from every collection—that has a field (any field, at any level of nesting) whose value is an empty list. Is that correct?
Example output:
collection id
foo_set foo:1234
foo_set abc123
bar_set 123
...
Are there any collections you'd be OK with the script ignoring? (The more data there is to process -> the longer the script will take to run... but it might not be a different any of us notices)
does this mean we should covert this back to a nmdc-runtime issue?
I wouldn't object
Can the pypi package check this?
I don't think it would help, at least with the way I was trying to check. I was starting by using the nmdc-schema pure-export
command to dump MongoDB contents. That apparently refuses to write False-like values, such as empty lists. The advantage of pure-export
is that it wraps the MongoDB contents in the corresponding Database
slots.
If we want to use LinkML validation to check for empty lists, I think it would be more helpful for somebody else to write a different dumper that includes the Database
slot wrapping.
But we/I should check with other LinkML experts like @cmungall or @pkalita-lbl to see if they have any insights into LinkML's ability to recognize empty lists.
Are there any collections you'd be OK with the script ignoring
At the very minimum, don't bother checking any collection whose name isn't a Database slot.
pure-export has code that addresses the selection of dump-worthy collections, but
@aclum is this discussion relevant to @eecavanna's question?:
From what I've seen this issue is limited to external identifier slots Yes, this should be run against prod.
At the very minimum, don't bother checking any collection whose name isn't a Database slot.
Thanks. I can use this snippet to determine the collections that are in both the schema and the database:
from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict
# ...
# Make a list of names of the collections that are in the schema.
nmdc_jsonschema: dict = get_nmdc_jsonschema_dict()
collection_names_in_schema = nmdc_jsonschema["$defs"]["Database"]["properties"].keys()
# Make a list of names of the collections that are in the database.
# Note: `db` is a pymongo reference to the nmdc database
collection_names_in_database: list[str] = db.list_collection_names()
# Make a list of the collection names that are in both of those lists.
collection_names_to_scan = list(set(collection_names_in_schema).intersection(set(collection_names_in_database)))
Here's the list of collections I came up with using that snippet:
Collections to scan (23 collections):
study_set
biosample_set
metagenome_sequencing_activity_set
read_based_taxonomy_analysis_activity_set
read_qc_analysis_activity_set
activity_set
processed_sample_set
metagenome_assembly_set
extraction_set
metagenome_annotation_activity_set
nom_analysis_activity_set
metatranscriptome_activity_set
omics_processing_set
material_sample_set
pooling_set
metabolomics_analysis_activity_set
functional_annotation_agg
mags_activity_set
data_object_set
metaproteomics_analysis_activity_set
library_preparation_set
collecting_biosamples_from_site_set
field_research_site_set
Here are the numbers of documents in each of those collections (as of right now):
study_set (19 documents)
biosample_set (7594 documents)
metagenome_sequencing_activity_set (631 documents)
read_based_taxonomy_analysis_activity_set (3053 documents)
read_qc_analysis_activity_set (3114 documents)
activity_set (0 documents)
processed_sample_set (5750 documents)
metagenome_assembly_set (2940 documents)
extraction_set (2127 documents)
metagenome_annotation_activity_set (2645 documents)
nom_analysis_activity_set (1985 documents)
metatranscriptome_activity_set (55 documents)
omics_processing_set (6214 documents)
material_sample_set (0 documents)
pooling_set (1491 documents)
metabolomics_analysis_activity_set (209 documents)
functional_annotation_agg (11822821 documents)
mags_activity_set (2645 documents)
data_object_set (138120 documents)
metaproteomics_analysis_activity_set (52 documents)
library_preparation_set (2132 documents)
collecting_biosamples_from_site_set (0 documents)
field_research_site_set (110 documents)
That's awesome, @eecavanna . Could you please enhance your report by giving one example of an empty set from each collection? Ideally the enhanced report would list the id
of the entity owning an empty list, and the slot that links that entity to the empty list, like
nmdc:sty-99-123456
; has_journal_retractions
I have a question that will influence the complexity of the search algorithm I use.
Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots?
Here's an example JSON object to illustrate what I mean by "top-level" slots versus "lower-level/nested" slots:
{
"id": "foo:123",
"list_e": [], // <-- empty list in top-level slot
"obj_e": {},
"list_f": [ "bar", { "name": "Fred", "age": 123 }, [], 789 ], // <-- empty list within lower-level/nested slot
}
"obj_f": { "baz": [] } // <-- empty list in lower-level/nested slot
}
In that example JSON object:
id
, list_e
, obj_e
, list_f
, and obj_f
list_f[1].name
and obj_f.baz
If I only check top-level slots, the algorithm won't involve recursion. If I also check lower-level/nested slots, the algorithm will involve recursion.
Could you please enhance your report by giving one example of an empty set from each collection?
Yes, I'll include that info.
I want to clarify that the lists of collections I posted above show all collections that are in both the schema and the database. It was not a report of collections having documents that contain empty lists. I posted that to share how much data the Python script would be searching through, after filtering out the irrelevant collections.
Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots?
Assuming it is the former (i.e. always in top-level slots): I have generated a report. It is a 7 MB CSV file with 108,850 rows of data in it. Here's a screenshot of the top of the file, to show its structure:
I spot checked two rows from the report and found they did, indeed, refer to top-level slots whose values were empty lists. Here's an example:
@turbomam and @aclum, I will send you the 7 MB report via Slack.
Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots
We need to check nested slots, too. This might be made more efficient by making a list of multi-valued lists in advance. You could do that by making a SchemaView and iterating over all slots, checking if multivalued
is True
for each.
Count - document_id | own_key_pointing_to_empty_list | ||||||||
---|---|---|---|---|---|---|---|---|---|
collection_name | alternative_identifiers | gold_analysis_project_identifiers | gold_biosample_identifiers | gold_sequencing_project_identifiers | gold_study_identifiers | has_metabolite_quantifications | mags_list | part_of | Total Result |
biosample_set | 4007 | 4007 | |||||||
data_object_set | 96203 | 96203 | |||||||
mags_activity_set | 2269 | 2269 | |||||||
metabolomics_analysis_activity_set | 1 | 1 | |||||||
metagenome_annotation_activity_set | 2033 | 2033 | |||||||
omics_processing_set | 3815 | 3815 | |||||||
read_based_taxonomy_analysis_activity_set | 518 | 518 | |||||||
study_set | 4 | 4 | |||||||
Total Result | 96203 | 2033 | 4007 | 3815 | 4 | 1 | 2269 | 518 | 108850 |
I updated the report so it checks nested slots, too (at infinity levels of depth). I also updated it to include the path to the empty slot. For example:
For this document...
{ a: { b: [ {}, { c: [] }, "foo" ] } }
...the path to the empty list would be reported as...
a.b[1].c
The resulting report (CSV file) was 80 MB and contained 709,679 rows, each of which contained a path to an empty list.
I spot checked two rows from the report and found they did, indeed, contain paths to empty lists. Here's an example:
@turbomam and @aclum, I will send you a 2.5 MB ZIP of the CSV file, via Slack.
Unique list of collection, slot collection_name,path_to_empty_list biosample_set,gold_biosample_identifiers data_object_set,alternative_identifiers mags_activity_set,mags_list metabolomics_analysis_activity_set,has_metabolite_quantifications metabolomics_analysis_activity_set,has_metabolite_quantifications[].alternative_identifiers metagenome_annotation_activity_set,gold_analysis_project_identifiers metaproteomics_analysis_activity_set,has_peptide_quantifications[].all_proteins omics_processing_set,gold_sequencing_project_identifiers read_based_taxonomy_analysis_activity_set,part_of study_set,gold_study_identifiers
➕ @eecavanna generated the following Markdown table from the above CSV string:
Unique list of collection ( collection_name ) |
slot ( path_to_empty_list ) |
---|---|
biosample_set | gold_biosample_identifiers |
data_object_set | alternative_identifiers |
mags_activity_set | mags_list |
metabolomics_analysis_activity_set | has_metabolite_quantifications |
metabolomics_analysis_activity_set | has_metabolite_quantifications[].alternative_identifiers |
metagenome_annotation_activity_set | gold_analysis_project_identifiers |
metaproteomics_analysis_activity_set | has_peptide_quantifications[].all_proteins |
omics_processing_set | gold_sequencing_project_identifiers |
read_based_taxonomy_analysis_activity_set | part_of |
study_set | gold_study_identifiers |
Once scripts that generate json ( the referenced issues) for ingest to mongo prod are updated we can make a plan to have runtime API return that records with empty lists are invalid. Putting this in the backlog until then.
processes that submit to mongo continue to do this. The latest ive found this week are ETL code, #373 has been in the backlog and https://github.com/microbiomedata/nmdc_automation/issues/259
When doing the re-iding work I noticed in several places we have documents with empty lists Ie many omics_processing_set records have gold_sequencing_project_identifiers with a list size of zero. Should the API reject records with empty lists or continue to accept them?
@dwinston @shreddd @turbomam
tasks