Open aclum opened 3 months ago
Thanks for summarizing the situation and laying out the acceptance criteria.
I took a look at this today. Here are my English translations of all the database queries performed within generate_functional_agg.py, specifically.
"Get all the distinct metagenome_annotation_id
values among all documents in the functional_annotation_agg
collection."
"For each document in the metagenome_annotation_activity_set
collection..."
"Insert these documents into the data_object_set
collection."
"Get the document having this id
value, from the data_object_set
collection."
Finally, here the aliases that appear in the list of queries above.
Similarly, here are my English translations of all the database queries performed within generate_metap_agg.py. They mirror the ones in the other file (i.e. same operations, different operands).
"Get all the distinct metaproteomic_analysis_id
values among all documents in the metap_gene_function_aggregation
collection."
"For each document in the metaproteomics_analysis_activity_set
collection..."
"Insert these documents into the metap_gene_function_aggregation
collection."
"Get the document having this id
value, from the data_object_set
collection."
Finally, here the aliases that appear in the list of queries above.
At this point, I'm wondering whether the Runtime API already provides the endpoints necessary for performing those operations. If it does, I think this is ready for implementation.
query 4 inserts into the aggregation tables (functional_annotation_agg and metap_gene_function_aggregation) not data_object_set.
the blocked ticket linked in the description, https://github.com/microbiomedata/nmdc-runtime/issues/611 prevents us from using json:submit to add documents via the API. It is possible we could use queries:run, I haven't tested that, but it would be nice to use an endpoint which had more validation. Additionally metap_gene_function_aggregation is not defined in the schema so i believe this disallows using any existing endpoints at this time.
query 4 inserts into the aggregation tables (functional_annotation_agg and metap_gene_function_aggregation) not data_object_set.
I think you are referring to the query I referred to as "Query 3." In both files, the query I referred to as "Query 4" is a find_one
and not an insertion.
The numbering I used was arbitrary (my objective was to catalog the queries, not so much to convey the algorithm) and might not match the order in which the queries are performed.
I'll add a topic to the agenda for tomorrow's infrastructure meeting, about addressing the things (in the Runtime) that are—or may be—blocking this.
Justification: In order to migrate runtime to the cloud for increased stability we need to transition code that interacts with mongo directly to API queries.
blocked by: https://github.com/microbiomedata/nmdc-runtime/issues/611 - resolved, we can use json:submit now to enter these records.
Acceptance critera: both generate_functional_agg.py and generate_metap_agg.py generate a request body which is submitted to a runtime API endpoint instead of using pymongo insert statements.
cc @sanjaypjana @eecavanna @shreddd @mbthornton-lbl
Subtasks:
26 @kheal will tackle using the API while refactoring the MetaP aggregations