Identify Mongo documents containing values that are empty lists (`[]`)

aclum commented 1 year ago

When doing the re-iding work I noticed in several places we have documents with empty lists Ie many omics_processing_set records have gold_sequencing_project_identifiers with a list size of zero. Should the API reject records with empty lists or continue to accept them?

@dwinston @shreddd @turbomam

tasks

[x] identify which collections/keys have empty lists
[x] create tickets to get json generation code updated
[ ] create and implement a plan to have runtime API reject records with empty lists.

aclum commented 1 year ago

Transferring to nmdc-schema per conversations with Mark. Suggestion was to set min cardinality in the schema. This needs to be coordinated with workflows so code that generates submissions gets updated. @corilo @Michal-Babins @mbthornton-lbl

aclum commented 1 year ago

@eecavanna offered to run a check in python to see what slots are populated with empty lists so we can reach out to the folks that write these documents.

turbomam commented 1 year ago

This API URL shows an example of a record with an asserted empty gold_sequencing_project_identifiers list:

https://api.microbiomedata.org/nmdcschema/ids/emsl%3A739472

turbomam commented 1 year ago

{
  "id": "emsl:739472",
  "name": "Brodie_158_MeOH_R3_23Mar19_HESI_Neg",
  "description": "High resolution MS spectra only",
  "has_input": [
    "igsn:IEWFS000S"
  ],
  "has_output": [
    "emsl:output_739472"
  ],
  "part_of": [
    "gold:Gs0135149"
  ],
  "instrument_name": "21T_Agilent",
  "omics_type": {
    "has_raw_value": "Organic Matter Characterization"
  },
  "processing_institution": "EMSL",
  "type": "nmdc:OmicsProcessing",
  "gold_sequencing_project_identifiers": []
}

turbomam commented 1 year ago

That doesn't make it into the YAML output of the pure-export script in it's current state. I think the YAML serializer refuses to write keys with empty values, at least in its default ceonfiguration.

Therefore, the nmdc-schema repo isn't ready to check for this sort of thing right now.

aclum commented 1 year ago

@turbomam does this mean we should covert this back to a nmdc-runtime issue? Can the pypi package check this?

eecavanna commented 1 year ago

@eecavanna offered to run a check in python to see what slots are populated with empty lists so we can reach out to the folks that write these documents.

I want to clarify the requirements: Generate a list consisting of the id of every document—from every collection—that has a field (any field, at any level of nesting) whose value is an empty list. Is that correct?

Example output:

collection  id
foo_set     foo:1234
foo_set     abc123
bar_set     123
...

Are there any collections you'd be OK with the script ignoring? (The more data there is to process -> the longer the script will take to run... but it might not be a different any of us notices)

turbomam commented 1 year ago

does this mean we should covert this back to a nmdc-runtime issue?

I wouldn't object

Can the pypi package check this?

I don't think it would help, at least with the way I was trying to check. I was starting by using the nmdc-schema pure-export command to dump MongoDB contents. That apparently refuses to write False-like values, such as empty lists. The advantage of pure-export is that it wraps the MongoDB contents in the corresponding Database slots.

If we want to use LinkML validation to check for empty lists, I think it would be more helpful for somebody else to write a different dumper that includes the Database slot wrapping.

But we/I should check with other LinkML experts like @cmungall or @pkalita-lbl to see if they have any insights into LinkML's ability to recognize empty lists.

turbomam commented 1 year ago

Are there any collections you'd be OK with the script ignoring

At the very minimum, don't bother checking any collection whose name isn't a Database slot.

pure-export has code that addresses the selection of dump-worthy collections, but

it isn't completely broken out into a modular function
it currently requires a PyMongo connection. I have an issue to remove this requirement.

@aclum is this discussion relevant to @eecavanna's question?:

https://github.com/microbiomedata/nmdc-schema/issues/1302#issuecomment-1799791361

aclum commented 1 year ago

From what I've seen this issue is limited to external identifier slots Yes, this should be run against prod.

eecavanna commented 1 year ago

At the very minimum, don't bother checking any collection whose name isn't a Database slot.

Thanks. I can use this snippet to determine the collections that are in both the schema and the database:

from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict

# ...

# Make a list of names of the collections that are in the schema.
nmdc_jsonschema: dict = get_nmdc_jsonschema_dict()
collection_names_in_schema = nmdc_jsonschema["$defs"]["Database"]["properties"].keys()

# Make a list of names of the collections that are in the database.
# Note: `db` is a pymongo reference to the nmdc database
collection_names_in_database: list[str] = db.list_collection_names()

# Make a list of the collection names that are in both of those lists.
collection_names_to_scan = list(set(collection_names_in_schema).intersection(set(collection_names_in_database)))

Here's the list of collections I came up with using that snippet:

Collections to scan (23 collections):
study_set
biosample_set
metagenome_sequencing_activity_set
read_based_taxonomy_analysis_activity_set
read_qc_analysis_activity_set
activity_set
processed_sample_set
metagenome_assembly_set
extraction_set
metagenome_annotation_activity_set
nom_analysis_activity_set
metatranscriptome_activity_set
omics_processing_set
material_sample_set
pooling_set
metabolomics_analysis_activity_set
functional_annotation_agg
mags_activity_set
data_object_set
metaproteomics_analysis_activity_set
library_preparation_set
collecting_biosamples_from_site_set
field_research_site_set

Here are the numbers of documents in each of those collections (as of right now):

study_set (19 documents)
biosample_set (7594 documents)
metagenome_sequencing_activity_set (631 documents)
read_based_taxonomy_analysis_activity_set (3053 documents)
read_qc_analysis_activity_set (3114 documents)
activity_set (0 documents)
processed_sample_set (5750 documents)
metagenome_assembly_set (2940 documents)
extraction_set (2127 documents)
metagenome_annotation_activity_set (2645 documents)
nom_analysis_activity_set (1985 documents)
metatranscriptome_activity_set (55 documents)
omics_processing_set (6214 documents)
material_sample_set (0 documents)
pooling_set (1491 documents)
metabolomics_analysis_activity_set (209 documents)
functional_annotation_agg (11822821 documents)
mags_activity_set (2645 documents)
data_object_set (138120 documents)
metaproteomics_analysis_activity_set (52 documents)
library_preparation_set (2132 documents)
collecting_biosamples_from_site_set (0 documents)
field_research_site_set (110 documents)

turbomam commented 1 year ago

That's awesome, @eecavanna . Could you please enhance your report by giving one example of an empty set from each collection? Ideally the enhanced report would list the id of the entity owning an empty list, and the slot that links that entity to the empty list, like

nmdc:sty-99-123456; has_journal_retractions

eecavanna commented 1 year ago

I have a question that will influence the complexity of the search algorithm I use.

Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots?

Here's an example JSON object to illustrate what I mean by "top-level" slots versus "lower-level/nested" slots:

{
  "id": "foo:123",
  "list_e": [],  // <-- empty list in top-level slot
  "obj_e": {},
  "list_f": [ "bar", { "name": "Fred", "age": 123 }, [], 789 ],  // <-- empty list within lower-level/nested slot
}
  "obj_f": { "baz": [] }  // <-- empty list in lower-level/nested slot
}

In that example JSON object:

The only top-level slots: id, list_e, obj_e, list_f, and obj_f
Some lower-level/nested slots are: list_f[1].name and obj_f.baz

If I only check top-level slots, the algorithm won't involve recursion. If I also check lower-level/nested slots, the algorithm will involve recursion.

eecavanna commented 1 year ago

Could you please enhance your report by giving one example of an empty set from each collection?

Yes, I'll include that info.

I want to clarify that the lists of collections I posted above show all collections that are in both the schema and the database. It was not a report of collections having documents that contain empty lists. I posted that to share how much data the Python script would be searching through, after filtering out the irrelevant collections.

eecavanna commented 1 year ago

Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots?

Assuming it is the former (i.e. always in top-level slots): I have generated a report. It is a 7 MB CSV file with 108,850 rows of data in it. Here's a screenshot of the top of the file, to show its structure:

I spot checked two rows from the report and found they did, indeed, refer to top-level slots whose values were empty lists. Here's an example:

@turbomam and @aclum, I will send you the 7 MB report via Slack.

turbomam commented 1 year ago

Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots

We need to check nested slots, too. This might be made more efficient by making a list of multi-valued lists in advance. You could do that by making a SchemaView and iterating over all slots, checking if multivalued is True for each.

turbomam commented 1 year ago

Count - document_id	own_key_pointing_to_empty_list
collection_name	alternative_identifiers	gold_analysis_project_identifiers	gold_biosample_identifiers	gold_sequencing_project_identifiers	gold_study_identifiers	has_metabolite_quantifications	mags_list	part_of	Total Result
biosample_set			4007						4007
data_object_set	96203								96203
mags_activity_set							2269		2269
metabolomics_analysis_activity_set						1			1
metagenome_annotation_activity_set		2033							2033
omics_processing_set				3815					3815
read_based_taxonomy_analysis_activity_set								518	518
study_set					4				4
Total Result	96203	2033	4007	3815	4	1	2269	518	108850

eecavanna commented 1 year ago

I updated the report so it checks nested slots, too (at infinity levels of depth). I also updated it to include the path to the empty slot. For example:

For this document...

{ a: { b: [ {}, { c: [] }, "foo" ] } }

...the path to the empty list would be reported as...

a.b[1].c

The resulting report (CSV file) was 80 MB and contained 709,679 rows, each of which contained a path to an empty list.

I spot checked two rows from the report and found they did, indeed, contain paths to empty lists. Here's an example:

@turbomam and @aclum, I will send you a 2.5 MB ZIP of the CSV file, via Slack.

aclum commented 1 year ago

Unique list of collection, slot collection_name,path_to_empty_list biosample_set,gold_biosample_identifiers data_object_set,alternative_identifiers mags_activity_set,mags_list metabolomics_analysis_activity_set,has_metabolite_quantifications metabolomics_analysis_activity_set,has_metabolite_quantifications[].alternative_identifiers metagenome_annotation_activity_set,gold_analysis_project_identifiers metaproteomics_analysis_activity_set,has_peptide_quantifications[].all_proteins omics_processing_set,gold_sequencing_project_identifiers read_based_taxonomy_analysis_activity_set,part_of study_set,gold_study_identifiers

➕ @eecavanna generated the following Markdown table from the above CSV string:

Unique list of collection (`collection_name`)	slot (`path_to_empty_list`)
biosample_set	gold_biosample_identifiers
data_object_set	alternative_identifiers
mags_activity_set	mags_list
metabolomics_analysis_activity_set	has_metabolite_quantifications
metabolomics_analysis_activity_set	has_metabolite_quantifications[].alternative_identifiers
metagenome_annotation_activity_set	gold_analysis_project_identifiers
metaproteomics_analysis_activity_set	has_peptide_quantifications[].all_proteins
omics_processing_set	gold_sequencing_project_identifiers
read_based_taxonomy_analysis_activity_set	part_of
study_set	gold_study_identifiers

aclum commented 1 year ago

Once scripts that generate json ( the referenced issues) for ingest to mongo prod are updated we can make a plan to have runtime API return that records with empty lists are invalid. Putting this in the backlog until then.

aclum commented 2 months ago

processes that submit to mongo continue to do this. The latest ive found this week are ETL code, #373 has been in the backlog and https://github.com/microbiomedata/nmdc_automation/issues/259

microbiomedata / nmdc-schema

Identify Mongo documents containing values that are empty lists (`[]`) #1306