microbiomedata / nmdc-aggregator

Scripts that periodically aggregate data related to KEGG search
0 stars 0 forks source link

update `generate_functional_agg.py` to also include `metatranscriptome_annotation_set` #8

Closed aclum closed 2 months ago

aclum commented 4 months ago

This PR adds new Database slots for metatranscriptomes, we need to make sure the metatranscriptome annotation records are included in the KEGG aggregation results.

cc @eecavanna

depends on https://github.com/microbiomedata/nmdc_automation/issues/195

eecavanna commented 4 months ago

Hi @aclum, the Issue description says "This PR". Did you mean to link to a PR here?

eecavanna commented 4 months ago

I can help with the "update image on Spin" (a.k.a. build and publish a new container image to GHCR and configure a Spin workload to run it) portions of this task.

aclum commented 4 months ago

Yes, sorry, link updated in the description.

aclum commented 3 months ago

@mbthornton-lbl to start on this this sprint and pair/hand off to @eecavanna for the SPIN portion.

example for unit test:

example gff

nmdc:wfmtan-11-5rqhd817.1_0000001   Prodigal v2.6.3_patched CDS 2931    5588    340.0   +   0   ID=nmdc:wfmgan-11-5rqhd817.1_0000001_2931_5588;translation_table=11;start_type=ATG;product=O-antigen biosynthesis protein;product_source=KO:K20444;cath_funfam=3.20.20.80,3.90.550.10;cog=COG0463;ko=KO:K20444;ec_number=EC:2.4.1.-;pfam=PF00535,PF02836;superfamily=51445,53448
nmdc:wfmgat-11-5rqhd817.1_0000001   Prodigal v2.6.3_patched CDS 5585    7381    320.4   +   0   ID=nmdc:wfmgan-11-5rqhd817.1_0000001_5585_7381;translation_table=11;start_type=ATG;product=ATP-binding cassette, subfamily B, bacterial;product_source=KO:K06147;cath_funfam=1.20.1560.10,3.40.50.300;cog=COG1132;ko=KO:K06147;pfam=PF00005,PF00664;smart=SM00382;superfamily=52540,90123

The above example gff is expected to insert two documents in functional_annotation_agg, one document for K20444 and one for K06147. expected new mongo record:

[{"metagenome_annotation_id":"nmdc:wfmgan-11-5rqhd817.1",
"gene_function_id":"KEGG.ORTHOLOGY:K20444",
"count":1},
{{"metagenome_annotation_id":"nmdc:wfmgan-11-5rqhd817.1",
"gene_function_id":"KEGG.ORTHOLOGY:K06147",
"count":1}
]

Example functional test would be use json:submit to submit a metatranscriptome_annotation_set and corresponding data_object_set records to runtime:dev including data object set record with a data_object_type of Functional Annotation GFF which had KEGG terms.

eecavanna commented 3 months ago

Hi @mbthornton-lbl, once you're ready to test this out in the development environment on Spin, you can @-mention me in a comment here that says that; in response to which I'll go through these steps to create a GitHub Release of this repo (which will automatically create and publish a new container image to GitHub Container Registry) and then update the associated workload in the development environment so that it runs that new container image; at which point I'll hand things back off to you to test it in that environment. I think that's the longest sentence I've written all year!

aclum commented 2 months ago

@eecavanna emptied out mongo dev last night in https://github.com/microbiomedata/infra-admin/issues/120 and it repopulated today with 25 million records which looks correct. There was an issue with the dev mongo to dev postgres ingest but @naglepuff restarted that just a few minutes ago. I'd like to make sure this works properly before we apply this to production.

aclum commented 2 months ago

@naglepuff said the ingest on dev went smoothly so we are good to update the image in SPIN prod. @chienchi if you don't have permission to do that please coordinate with @eecavanna

eecavanna commented 2 months ago

Hi @chienchi, the process is exactly the same in the production environment (namespace: nmdc) as in the development environment (namespace: nmdc-dev), except:

I assume whoever created those deployments named them differently by mistake.

Notes for future reference

chienchi commented 2 months ago

Thanks for the instructions. I have deployed the new version on production environment. image

eecavanna commented 2 months ago

Great! Looks good to me (thanks for including a screenshot).

In that case, I will do the task in this follow-on ticket (https://github.com/microbiomedata/infra-admin/issues/123), which is to empty out the functional_annotation_agg collection in the production Mongo database. I'll leave it to y'all to close this ticket when you want to.