update aggregation scripts to use API to submit instead of pymongo

aclum commented 3 months ago

Justification: In order to migrate runtime to the cloud for increased stability we need to transition code that interacts with mongo directly to API queries.

blocked by: https://github.com/microbiomedata/nmdc-runtime/issues/611 - resolved, we can use json:submit now to enter these records.

Acceptance critera: both generate_functional_agg.py and generate_metap_agg.py generate a request body which is submitted to a runtime API endpoint instead of using pymongo insert statements.

cc @sanjaypjana @eecavanna @shreddd @mbthornton-lbl

Subtasks:

26 @kheal will tackle using the API while refactoring the MetaP aggregations

eecavanna commented 3 months ago

Thanks for summarizing the situation and laying out the acceptance criteria.

I took a look at this today. Here are my English translations of all the database queries performed within generate_functional_agg.py, specifically.

Query 1

"Get all the distinct metagenome_annotation_id values among all documents in the functional_annotation_agg collection."

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_functional_agg.py#L120

Query 2

"For each document in the metagenome_annotation_activity_set collection..."

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_functional_agg.py#L121

Query 3

"Insert these documents into the data_object_set collection."

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_functional_agg.py#L134

Query 4

"Get the document having this id value, from the data_object_set collection."

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_functional_agg.py#L90

Finally, here the aliases that appear in the list of queries above.

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_functional_agg.py#L52-L54

eecavanna commented 3 months ago

Similarly, here are my English translations of all the database queries performed within generate_metap_agg.py. They mirror the ones in the other file (i.e. same operations, different operands).

Query 1

"Get all the distinct metaproteomic_analysis_id values among all documents in the metap_gene_function_aggregation collection."

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_metap_agg.py#L162

Query 2

"For each document in the metaproteomics_analysis_activity_set collection..."

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_metap_agg.py#L165

Query 3

"Insert these documents into the metap_gene_function_aggregation collection."

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_metap_agg.py#L186

Query 4

"Get the document having this id value, from the data_object_set collection."

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_metap_agg.py#L92

Finally, here the aliases that appear in the list of queries above.

https://github.com/microbiomedata/nmdc-aggregator/blob/3abf6ed3df57ebb220f5d17f0c430283937c7181/generate_metap_agg.py#L56-L58

eecavanna commented 3 months ago

At this point, I'm wondering whether the Runtime API already provides the endpoints necessary for performing those operations. If it does, I think this is ready for implementation.

aclum commented 3 months ago

query 4 inserts into the aggregation tables (functional_annotation_agg and metap_gene_function_aggregation) not data_object_set.

the blocked ticket linked in the description, https://github.com/microbiomedata/nmdc-runtime/issues/611 prevents us from using json:submit to add documents via the API. It is possible we could use queries:run, I haven't tested that, but it would be nice to use an endpoint which had more validation. Additionally metap_gene_function_aggregation is not defined in the schema so i believe this disallows using any existing endpoints at this time.

eecavanna commented 3 months ago

query 4 inserts into the aggregation tables (functional_annotation_agg and metap_gene_function_aggregation) not data_object_set.

I think you are referring to the query I referred to as "Query 3." In both files, the query I referred to as "Query 4" is a find_one and not an insertion.

The numbering I used was arbitrary (my objective was to catalog the queries, not so much to convey the algorithm) and might not match the order in which the queries are performed.

eecavanna commented 3 months ago

I'll add a topic to the agenda for tomorrow's infrastructure meeting, about addressing the things (in the Runtime) that are—or may be—blocking this.

microbiomedata / nmdc-aggregator