isomerpages / isomercms-backend

A static website builder and host for the Singapore Government
5 stars 1 forks source link

Feat/add back repair form lock #1179

Closed alexanderleegs closed 7 months ago

alexanderleegs commented 7 months ago

This PR adds back the lock for site repair form. Previously we removed the lock due to the issues we were facing on prod - this was most likely due to the lack of response to forms causing unexpected 4xx errors, in addition to the large number of repos we were repairing at the same time. This PR adds in the response to forms so that we can avoid that issue.

In addition, the time for locking has been set at 15 minutes instead of a scaling amount as discussed offline - this is because repos are being locked and repaired concurrently and will unlock when their respective operation completes without having to wait for the entire set to complete.

alexanderleegs commented 7 months ago

I was unable to verify this, but we originally observed that the env health was degraded due to a high number of errors - as we were running this off-peak, it's possible that this was the only request going through, which caused the env health issues.

For long running repos, it might indeed be a problem - i think this is a trade-off to consider! By having a longer lock time, it increases the risk of the user being locked out of their repo for a longer time in the case where the dynamodb query to unlock the repo fails

kishore03109 commented 7 months ago

I was unable to verify this, but we originally observed that the env health was degraded due to a high number of errors - as we were running this off-peak, it's possible that this was the only request going through, which caused the env health issues.

may i get a bit more context on why we couldnt verify this ya? did we look through the web.out logs during this time? maybe this might be to the multiple retries option by forms, and can be solved by just returning early?

For long running repos, it might indeed be a problem - i think this is a trade-off to consider! By having a longer lock time, it increases the risk of the user being locked out of their repo for a longer time in the case where the dynamodb query to unlock the repo fails

could we be more intentional about this? ie we take the worst case time + 5 mins? i am assuming we have the logic to delete the lock once the operation is done, this is in case something goes wrong and it takes a lot longer to resolve this?

also running of ideas here, could it be that the way the form is set up, it struggles to modify the content if there is a lock? my understanding here is that it is a no, since the lock logic is configured at a middleware level, and as thus cannot be affected by the lock logic

alexanderleegs commented 7 months ago

may i get a bit more context on why we couldnt verify this ya? did we look through the web.out logs during this time? maybe this might be to the multiple retries option by forms, and can be solved by just returning early?

There were no errors in the logs at the time! This was probably due to our logs only being for application level, whereas the error was mostly due to gateway timeout (due to us not returning a response to forms)

could we be more intentional about this? ie we take the worst case time + 5 mins? i am assuming we have the logic to delete the lock once the operation is done, this is in case something goes wrong and it takes a lot longer to resolve this?

Yup this is in an unhappy state where we fail to release the lock - normally once the operation is done the lock is released. We don't have an accurate time for the worst case scenario at the moment, I thought 15 minutes is a reasonable estimate for a single repo

also running of ideas here, could it be that the way the form is set up, it struggles to modify the content if there is a lock? my understanding here is that it is a no, since the lock logic is configured at a middleware level, and as thus cannot be affected by the lock logic

There's a check for lock status at the start of the process - in the case that it's unable to acquire the lock, the repo is immediately goes to the failure state, which is handled gracefully and is represented in the email sent to users