microbiomedata / nmdc_automation

Prototype automation
2 stars 2 forks source link

update scripts making nmdc json documents to exclude key if an empty array for sequencing workflows #24

Closed aclum closed 8 months ago

aclum commented 11 months ago

Is your feature request related to a problem? Please describe. Best practice for mongo is to leave a key out if the list of values is empty. We've noticed documents in mongo which use keys whose values are an empty list.

Describe the solution you'd like Update json creation code to exclude keys if the value is an empty array. The following keys have empty arrays in mongo prod

collection_name,path_to_empty_list data_object_set,alternative_identifiers mags_activity_set,mags_list metagenome_annotation_activity_set,gold_analysis_project_identifiers read_based_taxonomy_analysis_activity_set,part_of <- this key can be depricated for this workflow execution activity, is is not being populated and is redundant with was_informed_by. was_informed_by is what is required for the data portal.

Describe alternatives you've considered Leave data in mongo as is. Acceptance Criteria Json documents created by sequencing workflows have no keys where the value is an empty array.

Who will use this feature/enhancement? internal staff and possibly external users When will they use it? when querying the API or using tools like studio 3T/compass or pymongo How will they use it? queries and data manipulation will be easier How will they test it to make sure it's working? Eric can test this with code he wrote to check for empty arrays in https://github.com/microbiomedata/nmdc-schema/issues/1306 Is the request achievable? During one sprint? yes, yes What is your definition of done for this request? See acceptance criteria. Data would/could be cleaned up in mongo prod after scripts are updated deleting of these keys w/empty lists is out of scope for this ticket.

mbthornton-lbl commented 10 months ago

Underlying cause was incorrect serialization of the output - we are now using the correct serialization from linkml_runtime.dumpers

mbthornton-lbl commented 10 months ago

Example after correcting serialization:

[
    {
        "data_object_set": [
            {
                "id": "nmdc:dobj-11-k7vny888",
                "name": "9422.8.132674.GTTTCG.fastq.gz",
                "description": "Raw sequencer read data",
                "file_size_bytes": 2861414297,
                "data_object_type": "Metagenome Raw Reads",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-019yes10",
                "name": "nmdc_wfrqc-11-zma0ys31.1_filtered.fastq.gz",
                "description": "Filtered Reads for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 2571324879,
                "md5_checksum": "7bf778baef033d36f118f8591256d6ef",
                "data_object_type": "Filtered Sequencing Reads",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrqc-11-zma0ys31.1/nmdc_wfrqc-11-zma0ys31.1_filtered.fastq.gz",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-hty12n62",
                "name": "nmdc_wfrqc-11-zma0ys31.1_filterStats.txt",
                "description": "Filtered Stats for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 290,
                "md5_checksum": "b99ce8adc125c95f0bfdadf36a3f6848",
                "data_object_type": "QC Statistics",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrqc-11-zma0ys31.1/nmdc_wfrqc-11-zma0ys31.1_filterStats.txt",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-gast3j11",
                "name": "nmdc_wfmgas-11-3jvymb63.1_contigs.fna",
                "description": "Assembled contigs fasta for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 91134523,
                "md5_checksum": "b96c8e7796616a8eefe473bff2c62e52",
                "data_object_type": "Assembly Contigs",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_contigs.fna",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-bkza5366",
                "name": "nmdc_wfmgas-11-3jvymb63.1_scaffolds.fna",
                "description": "Assembled scaffold fasta for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 90622585,
                "md5_checksum": "6ca496a8b9b298278ad2b4010a7c8cb2",
                "data_object_type": "Assembly Scaffolds",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_scaffolds.fna",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-v9xfxp70",
                "name": "nmdc_wfmgas-11-3jvymb63.1_covstats.txt",
                "description": "Metagenome Contig Coverage Stats for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 14431055,
                "md5_checksum": "19782102f68575b03b7c12dd3d48e840",
                "data_object_type": "Assembly Coverage Stats",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_covstats.txt",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-dz2mw103",
                "name": "nmdc_wfmgas-11-3jvymb63.1_assembly.agp",
                "description": "Assembled AGP file for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 14581247,
                "md5_checksum": "419b294106e3fca4a06d18fd3c8e9181",
                "data_object_type": "Assembly AGP",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_assembly.agp",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-75skzn36",
                "name": "nmdc_wfmgas-11-3jvymb63.1_pairedMapped_sorted.bam",
                "description": "Metagenome Alignment BAM file for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 0,
                "md5_checksum": "d41d8cd98f00b204e9800998ecf8427e",
                "data_object_type": "Assembly Coverage BAM",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_pairedMapped_sorted.bam",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-ppa5pg23",
                "name": "nmdc_wfrbt-11-e79d5x03.1_gottcha2_report.tsv",
                "description": "Gottcha2 TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 13174,
                "md5_checksum": "bc7c1bda004aab357c8f6cf5a42242f9",
                "data_object_type": "GOTTCHA2 Classification Report",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_gottcha2_report.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-0yn4b055",
                "name": "nmdc_wfrbt-11-e79d5x03.1_gottcha2_report_full.tsv",
                "description": "Gottcha2 full TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 1035818,
                "md5_checksum": "9481434cadd0d6c154e2ec4c11ef0e04",
                "data_object_type": "GOTTCHA2 Report Full",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_gottcha2_report_full.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-ty0z3p61",
                "name": "nmdc_wfrbt-11-e79d5x03.1_gottcha2_krona.html",
                "description": "Gottcha2 Krona HTML report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 262669,
                "md5_checksum": "6b5bc6ce7f11c1336a5f85a98fc18541",
                "data_object_type": "GOTTCHA2 Krona Plot",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_gottcha2_krona.html",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-e6h68y35",
                "name": "nmdc_wfrbt-11-e79d5x03.1_centrifuge_classification.tsv",
                "description": "Centrifuge classification TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 2189843623,
                "md5_checksum": "933c71bbc2f4a2e84d50f0d3864cf940",
                "data_object_type": "Centrifuge Taxonomic Classification",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_centrifuge_classification.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-chgp8k25",
                "name": "nmdc_wfrbt-11-e79d5x03.1_centrifuge_report.tsv",
                "description": "Centrifuge TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 260134,
                "md5_checksum": "1a208e2519770ef50740ac39f1b9ba9a",
                "data_object_type": "Centrifuge Classification Report",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_centrifuge_report.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-0wbjqw24",
                "name": "nmdc_wfrbt-11-e79d5x03.1_centrifuge_krona.html",
                "description": "Centrifuge Krona HTML report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 2343980,
                "md5_checksum": "f112a3840464ae7a9cf4a3bf295edd5c",
                "data_object_type": "Centrifuge Krona Plot",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_centrifuge_krona.html",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-xteq6n75",
                "name": "nmdc_wfrbt-11-e79d5x03.1_kraken2_classification.tsv",
                "description": "Kraken classification TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 1785563917,
                "md5_checksum": "7ca01ea379f0baed96f87d1435925f95",
                "data_object_type": "Kraken2 Taxonomic Classification",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_kraken2_classification.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-1n5y1278",
                "name": "nmdc_wfrbt-11-e79d5x03.1_kraken2_report.tsv",
                "description": "Kraken2 TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 699896,
                "md5_checksum": "c85f2f2b4a518c4adb23970448a5cb45",
                "data_object_type": "Kraken2 Classification Report",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_kraken2_report.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-rtjb8n73",
                "name": "nmdc_wfrbt-11-e79d5x03.1_kraken2_krona.html",
                "description": "Kraken2 Krona HTML report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 4221977,
                "md5_checksum": "94ee1bc2dc74830a21d5c3471d6cf223",
                "data_object_type": "Kraken2 Krona Plot",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_kraken2_krona.html",
                "type": "nmdc:DataObject"
            }
        ],
        "metagenome_assembly_set": [
            {
                "id": "nmdc:wfmgas-11-3jvymb63.1",
                "name": "Metagenome Assembly Activity for nmdc:omprc-11-bn8jcq58",
                "started_at_time": "2021-10-11T02:28:26Z",
                "ended_at_time": "2021-10-11T04:56:04+00:00",
                "was_informed_by": "nmdc:omprc-11-bn8jcq58",
                "execution_resource": "NERSC-Cori",
                "git_url": "https://github.com/microbiomedata/metaAssembly",
                "has_input": [
                    "nmdc:dobj-11-019yes10"
                ],
                "has_output": [
                    "nmdc:dobj-11-gast3j11",
                    "nmdc:dobj-11-bkza5366",
                    "nmdc:dobj-11-v9xfxp70",
                    "nmdc:dobj-11-dz2mw103",
                    "nmdc:dobj-11-75skzn36"
                ],
                "type": "nmdc:MetagenomeAssembly",
                "part_of": [
                    "nmdc:omprc-11-bn8jcq58"
                ],
                "version": "v1.0.3",
                "asm_score": 6.577,
                "scaffolds": 169645,
                "scaf_logsum": 215363,
                "scaf_powsum": 24422,
                "scaf_max": 68135,
                "scaf_bp": 83496490,
                "scaf_n50": 45550,
                "scaf_n90": 141870,
                "scaf_l50": 470,
                "scaf_l90": 290,
                "scaf_n_gt50k": 1,
                "scaf_l_gt50k": 68135,
                "scaf_pct_gt50k": 0.08160224,
                "contigs": 169784,
                "contig_bp": 83494920,
                "ctg_n50": 45584,
                "ctg_l50": 470,
                "ctg_n90": 141996,
                "ctg_l90": 290,
                "ctg_logsum": 214373,
                "ctg_powsum": 24284,
                "ctg_max": 68135,
                "gap_pct": 0.00188,
                "gc_std": 0.11726,
                "gc_avg": 0.46001
            }
        ],
        "omics_processing_set": [
            {
                "id": "nmdc:omprc-11-bn8jcq58",
                "name": "Sand microcosm microbial communities from a hyporheic zone in Columbia River, Washington, USA - GW-RW T2_23-Sept-14",
                "description": "Sterilized sand packs were incubated back in the ground and collected at time point T2.",
                "has_input": [
                    "nmdc:bsm-11-qq8s6x03"
                ],
                "add_date": "2015-05-28",
                "gold_sequencing_project_identifiers": [
                    "gold:Gp0115663"
                ],
                "has_output": [
                    "nmdc:dobj-11-k7vny888"
                ],
                "mod_date": "2021-06-15",
                "ncbi_project_name": "Sand microcosm microbial communities from a hyporheic zone in Columbia River, Washington, USA - GW-RW T2_23-Sept-14",
                "omics_type": {
                    "has_raw_value": "Metagenome"
                },
                "part_of": [
                    "nmdc:sty-11-aygzgv51"
                ],
                "principal_investigator": {
                    "has_raw_value": "James Stegen"
                },
                "processing_institution": "JGI",
                "type": "nmdc:OmicsProcessing"
            }
        ],
        "read_qc_analysis_activity_set": [
            {
                "id": "nmdc:wfrqc-11-zma0ys31.1",
                "name": "Read QC Activity for nmdc:omprc-11-bn8jcq58",
                "started_at_time": "2021-10-11T02:28:26Z",
                "ended_at_time": "2021-10-11T04:56:04+00:00",
                "was_informed_by": "nmdc:omprc-11-bn8jcq58",
                "execution_resource": "NERSC-Cori",
                "git_url": "https://github.com/microbiomedata/ReadsQC",
                "has_input": [
                    "nmdc:dobj-11-k7vny888"
                ],
                "has_output": [
                    "nmdc:dobj-11-019yes10",
                    "nmdc:dobj-11-hty12n62"
                ],
                "type": "nmdc:ReadQcAnalysisActivity",
                "part_of": [
                    "nmdc:omprc-11-bn8jcq58"
                ],
                "version": "v1.0.8",
                "input_read_count": 32238374,
                "output_read_count": 30774080,
                "input_read_bases": 4867994474,
                "output_read_bases": 4608772924
            }
        ],
        "read_based_taxonomy_analysis_activity_set": [
            {
                "id": "nmdc:wfrbt-11-e79d5x03.1",
                "name": "Readbased Taxonomy Analysis Activity for nmdc:omprc-11-bn8jcq58",
                "started_at_time": "2021-10-11T02:28:26Z",
                "ended_at_time": "2021-10-11T04:56:04+00:00",
                "was_informed_by": "nmdc:omprc-11-bn8jcq58",
                "execution_resource": "NERSC-Cori",
                "git_url": "https://github.com/microbiomedata/ReadbasedAnalysis",
                "has_input": [
                    "nmdc:dobj-11-019yes10"
                ],
                "has_output": [
                    "nmdc:dobj-11-ppa5pg23",
                    "nmdc:dobj-11-0yn4b055",
                    "nmdc:dobj-11-ty0z3p61",
                    "nmdc:dobj-11-e6h68y35",
                    "nmdc:dobj-11-chgp8k25",
                    "nmdc:dobj-11-0wbjqw24",
                    "nmdc:dobj-11-xteq6n75",
                    "nmdc:dobj-11-1n5y1278",
                    "nmdc:dobj-11-rtjb8n73"
                ],
                "type": "nmdc:ReadBasedTaxonomyAnalysisActivity",
                "part_of": [
                    "nmdc:omprc-11-bn8jcq58"
                ],
                "version": "v1.0.5"
            }
        ]
    }
]
mbthornton-lbl commented 10 months ago

@aclum We did not create a separate PR for this issue, but it is fixed, and I believe it can be closed

aclum commented 10 months ago

looks good, this can be closed.

aclum commented 8 months ago

I'm seeing records in mongo that are new that have null values. ie { "_id": { "$oid": "65c46ec8bbacb81f5f775562" }, "id": "nmdc:wfmgan-11-4sc85678.1", "name": "Metagenome Annotation Analysis Activity for nmdc:wfmgan-11-4sc85678.1", "started_at_time": "2024-02-07T22:56:21.682913+00:00", "ended_at_time": "2024-02-08T06:03:41.175440+00:00", "was_informed_by": "nmdc:omprc-11-9mvz7z22", "used": null, "execution_resource": "NERSC-Perlmutter", "git_url": "https://github.com/microbiomedata/mg_annotation", "has_input": [ "nmdc:dobj-11-5eb6v689" ], "type": "nmdc:MetagenomeAnnotationActivity", "has_output": [ "nmdc:dobj-11-y3f47w18", "nmdc:dobj-11-e2r3ge57", "nmdc:dobj-11-7k80qv75", "nmdc:dobj-11-0z5rhk53", "nmdc:dobj-11-feses595", "nmdc:dobj-11-sb28nx57", "nmdc:dobj-11-vn9pwz37", "nmdc:dobj-11-9x2zaf16", "nmdc:dobj-11-9prnyr33", "nmdc:dobj-11-xx1tb938", "nmdc:dobj-11-72e7f129", "nmdc:dobj-11-y563v150", "nmdc:dobj-11-yawesx56", "nmdc:dobj-11-za36h087", "nmdc:dobj-11-vjfcne42", "nmdc:dobj-11-bsq83730", "nmdc:dobj-11-x2m4k008", "nmdc:dobj-11-jre1qx13", "nmdc:dobj-11-9kpz9641", "nmdc:dobj-11-k62fk420", "nmdc:dobj-11-ac70cp72", "nmdc:dobj-11-ntsb3x16", "nmdc:dobj-11-pcckzg89" ], "part_of": [ "nmdc:omprc-11-9mvz7z22" ], "version": "v1.0.4", "qc_status": null, "qc_comment": null, "has_failure_categorization": [], "gold_analysis_project_identifiers": [] }

aclum commented 8 months ago

Closing per Micheal's recommendation in favor of #55