microbiomedata / nmdc_automation

Prototype automation
2 stars 2 forks source link

Missing keys in activity records #251

Closed aclum closed 1 month ago

aclum commented 2 months ago

Some keys in the activity records are not being populated to mongo. The example below is for MAGs but this appears to be a generic problem.

For example: binned_contig_num, mags_list for the MagsAnalysisActivity.

The last records that contain this information in the mongo prod documents are from February 2024

Shane said this change is likely related to the shadow schema classes and referenced this for loop https://github.com/microbiomedata/nmdc_automation/blob/acee08ecf776c0c0a6de07549f3[…]28e2c0ac02c41/nmdc_automation/workflow_automation/watch_nmdc.py

In discussions with Michael first place to look is the create_activity_record function. https://github.com/microbiomedata/nmdc_automation/blob/52917d816ee710c036855a8273657341d1e644d3/nmdc_automation/workflow_automation/wfutils.py#L306

for an example MAGS workflow /pscratch/sd/n/nmdcda/cromwell-executions/nmdc_mags/9492a397-eb30-472b-9d3b-b44b676f4652/call-finish_mags/execution the code should check stats file nmdc_wfmag-11-g7msr323.1_mags_stats.json. In this case binned_contig_num should exist in the record with a value of 22281.

cc @scanon

mbthornton-lbl commented 1 month ago

@aclum @scanon Where do we find the mag_stats.json file?

For MAGs, workflows.yaml specifies this:

    Workflow Execution:
      name: "Metagenome Assembled Genomes Analysis for {id}"
      type: nmdc:MagsAnalysis
      binned_contig_num: "{outputs.final_stats_json.binned_contig_num}"
      input_contig_num: "{outputs.final_stats_json.input_contig_num}"
      low_depth_contig_num: "{outputs.final_stats_json.low_depth_contig_num}"
      mags_list: "{outputs.final_stats_json.mags_list}"
      too_short_contig_num: "{outputs.final_stats_json.too_short_contig_num}"
      unbinned_contig_num: "{outputs.final_stats_json.unbinned_contig_num}"

The metaMAGs workflow has this in its Output files:

|-- project_name_mags_stats.json

Where do we find the stats.json data? I assume that the WorkflowExecution entries like |-- project_name_mags_stats.json map to things in the outputs records in the metadata returned by Cromwell from the /metadata endpoint.

It would be helpful to have examples of the stats.json file and a Cromwell metadata api response

ssarrafan commented 1 month ago

@aclum @scanon @mbthornton-lbl any chance this can be closed this sprint?