microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
5 stars 3 forks source link

Minter: Using multiple minter instances and discarding their databases can result in ID collisions #615

Open eecavanna opened 1 month ago

eecavanna commented 1 month ago

Background

We use a minter to generate IDs for Mongo documents. The minter is part of the Runtime. We have multiple environments (e.g. production, development, Berkeley), each of which has its own Runtime and, therefore, minter. Each minter keeps track of the IDs it has generated (i.e. consumed) in its Mongo database.

Because the minter is coupled to the Runtime (the former is part of the latter), and team members sometimes (e.g. when writing database migration scripts) want to mint IDs for classes that aren't defined in the schema currently being used by the production Runtime, team members sometimes mint IDs in non-production environments and later insert the documents having those IDs into the production database.

Problem

There are two problems:

  1. We routinely discard some of our non-production databases (e.g. development, Berkeley), replacing them with updated dumps of the production database. For example, we replace the development database as part of the standard monthly release process. As a result, the minters in those environments lose track of the IDs they have generated. This means it is possible that those minters generate an ID they have previously generated, a second time.

  2. Each minter is configured with an environment variable named MINTING_SERVICE_ID, whose value is incorporated into the IDs generated by that minter. I assume the person that designed the minter intended for people to populate with a string that is not used in any other environment; but I don't think anything prevents someone from using the same MINTING_SERVICE_ID value in multiple environments (i.e. the values of MINTING_SERVICE_ID, themselves, are not "minted" by a single authority). If someone were to do so, it would be possible that two minters generate identical IDs as one another (since those minters would be using different Mongo databases to keep track of the IDs it has generated).

Task

Related

https://github.com/microbiomedata/nmdc-runtime/issues/484 - A ticket about extracting the Runtime into a standalone service

CC: @dwinston , @PeopleMakeCulture

dwinston commented 1 month ago

Given these problems, I think it makes increasing sense to extract some minter functionality into a singular standalone service.

I see a couple of potential routes we can take:

  1. De-couple (a) syntactic construction of IDs from (b) uniqueness-ensuring persistence of IDs. That is, continue to have each runtime instance generate IDs as a function of the schema installed for the instance, but require runtime instances to synchronously register the IDs they mint with a singular ID registry service in order to ensure global uniqueness and persistence. In this world, the registry service can assign MINTING_SERVICE_ID values for each runtime instance across all environments.

  2. Migrate all minting functionality to a singular central service (as described by #484).