"Maintenance Mode" for user accounts

kennsippell commented 9 months ago

Is your feature request related to a problem? Please describe. Loss of Health Data - Today, there is risk of data loss any time an user manager: 1) moves a user's area to a different spot in the hierarchy, 2) disables a user account, 3) replaces a user with another. You can see data-loss happening for live projects in issues like https://github.com/medic/config-pih/issues/719 where hundreds of denied replications are happening for this month-long period.

Burdensome Human Coordination - In our documentation for move-contacts, we require that "users must be encouraged to clear cache and resync!" to avoid this sort of dataloss. Users need to do this before the move-contacts command is executed and coordinating these sorts of activities with users/devs is very time consuming.

Also - when you run multiple move-contacts commands, you can take down a server like in https://github.com/medic/config-muso/issues/932 where the server was down for 12 days. This makes coordination even more difficult. For Uganda eCHIS where the entire nation is on one instance, how do you ensure that everybody who is moving contacts is talking to everybody else?

These programmatic steps required to do user management safely are becoming increasily difficult with scale. Without the availability of better tooling, project teams do not have time to coordinate these activities and have no option but to accept the risk of data loss.

Describe the solution you'd like We are creating automation to improve user management scenarios with cht-user-management. A noteworthy example on the roadmap, is a UI and cloud-based execution of move-contacts commands which aims to execute move-contact commands safely. https://github.com/medic/cht-user-management/issues/12

This issue tracks a request to create some sort of "maintenance mode" for user-accounts which will allow automation to perform operations on them without dataloss.

Something like:

Automation can set a flag on a user to "put into maintenance mode"
Next time user syncs their data, the user is automatically logged out after the sync complete successfully
All data is cleared from the user's device
User cannot login, should see an error like "Your account is in maintenance mode"
The user's account is flagged so automation knows the user has synced (account maintenance is now safe).
In the example above, this is when move-contacts could be safely executed.
Automation removes the flag keeping the account in maintenance mode
User can now login. Maybe with their original credentials, maybe with a resent magic link, etc.

jkuester commented 9 months ago

Thanks to @mrjones-plip's prompting, I took a closer look at the feasibility of leveraging the existing create_user_for_contacts transition as a sort of "maintenance mode" for a user (leveraging the user replace functionality). The TLDR is that it seems very possible to leverage this functionality to solve many of the most challenging aspects of the workflow described above!

The fundamental principal to this approach is that, to an end user, there is not much difference in putting their user in "maintenance mode" so they cannot login and then taking them back out again so they can login vs just disabling their original user and creating them a new one (besides the obvious of not being able to re-use credentials). The tricky part of both scenarios is making sure we don't lose data when initially logging the user out (and this is where the create_user_for_contacts functionality comes in).

Just now I ran the following exercise (and the same should work on any >=4.1.0 CHT instance):

Enable create_user_for_contacts transition with documented configuration for app_url, token_login, and transitions (no need to configure any replace_forms).
Login with an offline user (chw_a) associated with a contact (contact_a).
Make chw_a's device go offline so it is no longer syncing to the server
On chw_a's device add new contacts/reports that do not exist on the server.

On the server (e.g. via Fauxton), update contact_a's contact doc to have:

    "user_for_contact": {
        "replace": {
            "chw_a": {
                "status": "PENDING",
                "replacement_contact_id": "contact_a"
            }
        }
    },

Bring the device for chw_a back online and sync
- Immediately upon completion of the sync, the device updates the user_for_contact status from PENDING to READY, syncs that change to the contact, and logs out the user on the device.
The status change on the contact triggers Sentinel to change the password for chw_a (to a random value). This invalidates all existing sessions for that user and prevents any more data from being synced. Then Sentinel will create a new user (chw_b) associated with whatever contact was set for replacement_contact_id (in this case it would still just be contact_a).

Once the status on the server's copy of contact_a changes from PENDING to COMPLETE, you can be confident that all of the user's data was synced and the user is now logged out. This would be the point where you could safely perform move-contact operations that would affect the user. Once those operations are complete, you can provide the CHW with the credentials for the new user. When they login to the new user, they will do a fresh sync of data from the server.

Caveats:

The main rough edge in this process is that once Sentinel finishes writing the new user, it will trigger an outbound message to the phone number on the replacement_contact_id containing the new token login link. Really you want to send this line _once all your move-contact operations are complete (so the user cannot inadvertently login too early). If you have no SMS Gateway configured, then the message will not actually be delivered. Another workaround would be to set a custom (dummy) value for the phone number on the replacement_contact_id so the message would not be delivered to the CHW.
While, it would be technically possible to update the password (or generate a token login link) for the chw_a user and rehabilitate that user once all the move-contact operations are done, there are a few challenges to this. There is no way (at this point) to prevent the creation of the new user (chw_b), so extra users would be added one way or another). Also, if the CHW tries to log back into chw_a on their original device (without uninstalling the app or clearing the data), the data on the device from chw_a will still exist and a fresh sync will not be done. So, the safest approach is just switching to a new user.

✅ Automation can set a flag on a user to "put into maintenance mode" _(set user_for_contact.replace...status = 'PENDING')_
✅ Next time user syncs their data, the user is automatically logged out after the sync complete successfully
☑ All data is cleared from the user's device (data is technically not cleared, but a new user would trigger a fresh sync)
☑ User cannot login, should see an error like "Your account is in maintenance mode" (Currently no nice messages, but user would be automatically logged out.)
✅ The user's account is flagged so automation knows the user has synced (account maintenance is now safe). _(Can watch for user_for_contact.replace...status = 'COMPLETE')_
✅ In the example above, this is when move-contacts could be safely executed.
☑ Automation removes the flag keeping the account in maintenance mode (Just switching to new user)
☑ User can now login. ~~Maybe with their original credentials~~, maybe with a resent magic link, etc.

Obviously none of this is an ideal workflow and it does not address any of the problems at the heart of move-contacts being so painful (looking to https://github.com/medic/cht-core/issues/6543 to maybe offer a glimmer of hope in that regard). But, it is functionality that already exists today in the CHT that I think could be leveraged to build a viable "maintenance mode" workflow.

@mrjones-plip please add any additional comments/questions that I have missed!

mrjones-plip commented 9 months ago

@jkuester - thanks so much for the deep dive on if my harebrained idea might work! I have nothing more to add.

@kennsippell - let me know if you'd like some help prototyping any of this!

kennsippell commented 9 months ago

Thanks guys. I'll check out this very interesting proposal.

medic / cht-core

"Maintenance Mode" for user accounts #8860