microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
7 stars 3 forks source link

Cannot determine origin of Mongo index #769

Open eecavanna opened 2 days ago

eecavanna commented 2 days ago

The functional_annotation_agg collection (as of when schema 11.0.3 is in effect) has a compound index on the pair of fields: "gene_function_id, metagenome_annotation_id". I don't see code in any of our GitHub repos (I searched across all repos in our org) that creates that index, so I'm assuming someone created it manually at some point.

This led me to wonder (things like):

I know the Runtime creates some; for example:

https://github.com/microbiomedata/nmdc-runtime/blob/17cf31332aee0852d20137aec7b8b2d3398caed0/nmdc_runtime/api/main.py#L299

https://github.com/microbiomedata/nmdc-runtime/blob/17cf31332aee0852d20137aec7b8b2d3398caed0/nmdc_runtime/site/ops.py#L1132

Tasks


Note: The specific index I mentioned above will cease to exist within a few days (as part of the migration from schema 11.0.3 to 11.1.0). I'm using it here to exemplify a concept.

dwinston commented 1 day ago

It was me. I wanted to create a unique=True compound index to check if that compound key could serve as a surrogate-yet-still-semantic primary key in lieu of id, which that collection didn't have, as part of my exploration to address https://github.com/microbiomedata/nmdc-runtime/issues/414. I ultimately settled on using mongodb's native (sortable, unique) _id field in order to implement pagination for that collection.

Indeed, the compound index is not used in any production capacity.

eecavanna commented 1 day ago

Thanks, @dwinston. In terms of how it was created, was it created manually via mongosh (or equivalent) or was it created by some code that exists in the repo?

dwinston commented 1 day ago

I should add that, due to the large size of the functional_annotation_agg collection at the time of that work on #414, that collection was not included in my workflow for ensuring a local cache of production schema collections for sandboxed local db experimentation during development. The temptation to "experiment" on (even "non-destructively", as is the case with index creation) the production database has since been remedied. :)

dwinston commented 1 day ago

@eecavanna it was created manually via direct (py)mongo command using privileged credentials (perhaps in a jupyter notebook where I was prototyping, perhaps via Studio3T GUI -- I forget).

eecavanna commented 1 day ago

OK, I understand now—thanks!

dwinston commented 1 day ago

While it's top-of-mind for me: the runtime declares collection-qualified slots to index via https://github.com/microbiomedata/nmdc-runtime/blob/c4c4a8d08f88c7fed71d693c7d45c7cea4854db9/nmdc_runtime/api/models/util.py#L85, which feeds https://github.com/microbiomedata/nmdc-runtime/blob/c4c4a8d08f88c7fed71d693c7d45c7cea4854db9/nmdc_runtime/api/main.py#L351 on runtime api init: https://github.com/microbiomedata/nmdc-runtime/blob/c4c4a8d08f88c7fed71d693c7d45c7cea4854db9/nmdc_runtime/api/main.py#L390