Currently, the DCR Monitor / DCR App works as follows:
There are a few problems with this design.
Config and CloudProvider becoming very shallow & wide module, which is essentially needed by every other modules.
Also CloudProvider mixes up different level of abstractions.
Current Monitor is a cron job that runs every 1 minute, but it has no idea about the current Jobs in the database. Thus, it "infers" the job by querying GCP to get live instances and parsing its parameters. This is very error prone design and hard to understand.
Because the monitor does not have access to the database, it needs to call UpdateJob endpoint in the API so that API will find the job again and update the record. This ends up with mixing up the internal/external endpoints.
As a result of this, KanikoService, JobService, and CloudProvider are currently doing all heavy lifting, while they are significantly intertwined with dependency. This is not very extensible.
Thus, I propose the following changes (See Proposed in the diagram):
Change monitor to be a "reconciler", which is responsible for coordinating image build, job submission, and state updates of the jobs. The user-facing JobService will be only responsible for direct interaction with the user.
Both JobReconciler and JobService will share the same database as a ground truth.
JobReconciler will have TEEBackend and ImageBuilder, which are interfaces to manage TEE instances, and image building. By this way, we can make TEE backend and image builder extensible. Also easier to test by injecting dependencies.
Currently, the DCR Monitor / DCR App works as follows:
There are a few problems with this design.
Config
andCloudProvider
becoming very shallow & wide module, which is essentially needed by every other modules. AlsoCloudProvider
mixes up different level of abstractions.Monitor
is a cron job that runs every 1 minute, but it has no idea about the current Jobs in the database. Thus, it "infers" the job by querying GCP to get live instances and parsing its parameters. This is very error prone design and hard to understand.UpdateJob
endpoint in the API so that API will find the job again and update the record. This ends up with mixing up the internal/external endpoints.KanikoService
,JobService
, andCloudProvider
are currently doing all heavy lifting, while they are significantly intertwined with dependency. This is not very extensible.Thus, I propose the following changes (See Proposed in the diagram):
JobReconciler
andJobService
will share the same database as a ground truth.JobReconciler
will haveTEEBackend
andImageBuilder
, which are interfaces to manage TEE instances, and image building. By this way, we can make TEE backend and image builder extensible. Also easier to test by injecting dependencies.