isomerpages / isomercms-backend

A static website builder and host for the Singapore Government
5 stars 1 forks source link

feat(monitoring): add scheduler functionality #1383

Closed kishore03109 closed 3 months ago

kishore03109 commented 5 months ago

Problem

This is the second part of the monitoring feature that we want to build. This PR only cares about adding a scheduler + the related infra needed for this to function. this will make the monitor run once every 5 mins, for oncalls to pick any related alarms from this.

Adding the alarms is done in the downstream PR .

Solution

Using bullmq to conveniently create a queue, a worker and a repeatable job over multiple instances. We do some level of exponential backoff retries since it is a nice to have and easy to implement. The original /site-up code has since been refactored to return an err or a ok, depending on whether the configuration is ideal. Unfortunately, this caused quite a number of edge cases to pop up. Due to the nature of this, a more loose check of whether the isomer logo is present is being used to determine if a site is up. Even with this loose check, we have a workplacelearning.gov.sg who have modified their site to not have the Isomer logo. Have used gb to code white list this weird site. Potentially, if tomorrow we have an alarm of a site going down, but this is expected to prolong, we can go to growthbook and change the config for this to be whitelisted.

Breaking Changes

Tests

Screenshot 2024-05-21 at 11 20 59 AM

on deployment, assert that you see these logs. it is ok for there to be multiple instances of this log (it directly corresponds to the number of instances that we have) since bullmq is smart enough to only create one queue, and one repeatable job over multiple instances.

Deploy Notes

corresponding infra pr should be deployed to production and only then should the redis host value be populated into the 1pw for production.

Additionally, post approval of this pr, add two alarms, one for Error running monitoring service and another for Monitoring service has failed. These are errors when the job has failed to be initalised, and when there is a new error.

New environment variables:

New dependencies:

kishore03109 commented 5 months ago

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @kishore03109 and the rest of your teammates on Graphite Graphite

seaerchin commented 5 months ago

test failure from a missing keyCdn.apiKey! could i trouble you to fix this @kishore03109

kishore03109 commented 4 months ago

test failure from a missing keyCdn.apiKey! could i trouble you to fix this @kishore03109

added from downstream pr

kishore03109 commented 4 months ago

@seaerchin

this seems like a single producer, potentially multiple consumer case but it should be ok since we don't care about ordering.

hmm wait, no this is why we are using bullmq, right? it implements some sort of locking mechanism to ensure that only one consumer (instance) is doing the job at one time?

seaerchin commented 4 months ago

@seaerchin

this seems like a single producer, potentially multiple consumer case but it should be ok since we don't care about ordering.

hmm wait, no this is why we are using bullmq, right? it implements some sort of locking mechanism to ensure that only one consumer (instance) is doing the job at one time?

how bullmq works is (according to their docs, at least) that a worker must continually inform the queue that it's still working on the job. if this doesn't happen (thread is always busy on worker), then the job will be stalled and it's possible to have another worker working on it.

but for clarification, this isn't what i'm concerned about! my comment was wrt out of order consumption of jobs. i think in this case, we don't really care since the job is just to monitor all the site and ordering doesn't really matter

kishore03109 commented 3 months ago

Merge activity