Automate creating dev DB contents in h and lms

Belongs to EPIC: https://github.com/hypothesis/product-backlog/issues/1044

Problem: Currently you have to manually create a lot of stuff (users, groups, authclients, ...) in your h and lms DBs in dev in order to get those two apps working together in dev.

Requirements:

It would be nice if all the necessary data could be created automatically for you when you run make dev.
We would also need a way to re-create the data if the developer messes it up or deletes it.
We'd also need a way to add to / delete/ update / migrate the data, if for example a new feature has been merged that requires new dev data, or requires a change to the dev data
It's probably not acceptable to delete any custom data added to the DB by the developer. It should insert / update / fix all the standard data, but not touch any other existing data
Some of the data that we want to load into dev DBs is sensitive-ish data that we don't want to post in a public repo. One solution that's been suggested is that the public repo's (h and lms) could contain a script that pulls the private part of the data from a private GitHub repo. See Slack discussion: https://hypothes-is.slack.com/archives/C4K6M7P5E/p1566218197000200

See also:

https://github.com/hypothesis/devdata This isn't a good solution to this problem (SQL dumps) for a bunch of reasons, but see that repo for some of the dev data that needs to be created in h: users, an authclient for the Hypothesis client, and stuff for the publisher account test site, BioPub test site and LMS app

I did some research on this yesterday. You can attach a volume to each postgres container and it will output data into the volume (aka mounted local directory). This could then be committed somewhere where users could pull it down. As mentioned above the issue with this is the committed directory containing sensitive information, although it might be possible to put this in a private or encrypted repo like we use for other sensitive information. The sensitive things here being secrets used to connect/authenticate to external services like lms or google.

As mentioned above, you also have to consider how migrations would work with this as this would be a base data set that potentially devs might build upon locally. While their data would persist locally handling updates to the base data set and merging those changes in might require some more thought. I'd assert however, the base dataset isn't likely to change very often. One solution that comes to mind for this would be to have migration scripts setup for the base dataset just like we do for production. This would require a lot of maintenance overhead though.

I'd recommend as a first step in order to scope this down to something less complicated and do-able to not handle migrations of the base data set. Having a default data set is a step up from what we have now and I think we should start there and iterate on it.

I'd like us to consider an alternative suggestion -- add a Python script make devdata that inserts the development data into the DB if it doesn't exist already, as in the diff below.

The script should be written to create the development data (users, groups, organizations, authclients, ...) if it doesn't already exist, or update it to have the right values if it's incorrect, while not touching any other data in the DB.

This approach seems to have a lot of advantages:

It relies on existing technologies that the team knows well: a Python script that can make use of h's existing models, services, and factories, and that integrates with the existing bin/hypothesis command
If the developer deletes or modifies the dev data the script can just be re-run and it'll recreate or correct the data
Generally the script can be re-run safely whenever you want and it'll just create or update the dev data, or do nothing if there's nothing to do, and never touch your other data
We can have make dev run make devdata automatically so in fact it'll be completely automatic: developers never need to do anything, and just always have the correct dev data in their DB
If we ship an update to the dev data, e.g. adding a new auth client, this will just be a PR that adds some lines to a Python script. Works with GitHub PRs and diffs. Easy for our team to maintain and review.
Once an update is merged re-running make devdata will add the new objects to your DB, or make updates to existing devdata objects, without touching existing stuff, so migrations shouldn't be a problem
For private data a Python script can clone a private GitHub repo containing a JSON file to a temporary directory

I'm having a hard time thinking of any downsides to this approach.

Below is a rough diff that works as a proof-of-concept for this, adding some dev users. See https://github.com/hypothesis/h/pull/4963 for a more complete example.

commit 03a4aced59b11bd1a0f8d7cc34b761d7c7e15efb (HEAD -> devdata)
Author: Sean Hammond <git@snhmnd.fastmail.com>
Date:   Tue Aug 27 11:17:36 2019 +0100

    Add devdata command

diff --git a/Makefile b/Makefile
index bbfbe75ee..eb99280a4 100644
--- a/Makefile
+++ b/Makefile
@@ -4,6 +4,7 @@ help:
    @echo 'make services          Run the services that `make dev` requires'
    @echo "                       (Postgres, Elasticsearch, etc) in Docker Compose"
    @echo "make dev               Run the app in the development server"
+   @echo "make devdata           Insert standard development data into the DB"
    @echo "make shell             Launch a Python shell in the dev environment"
    @echo "make sql               Connect to the dev database with a psql shell"
    @echo "make lint              Run the code linter(s) and print any warnings"
@@ -37,6 +38,10 @@ services:
 dev: build/manifest.json python
    tox -q -e py36-dev

+.PHONY: devdata
+devdata: python
+   @tox -q -e py36-dev -- sh bin/hypothesis --dev devdata
+
 .PHONY: shell
 shell: python
    tox -q -e py36-dev -- sh bin/hypothesis --dev shell
diff --git a/h/cli/__init__.py b/h/cli/__init__.py
index 7e5434625..4f2a5e47e 100644
--- a/h/cli/__init__.py
+++ b/h/cli/__init__.py
@@ -17,6 +17,7 @@ SUBCOMMANDS = (
     "h.cli.commands.annotation_id.annotation_id",
     "h.cli.commands.authclient.authclient",
     "h.cli.commands.celery.celery",
+    "h.cli.commands.devdata.devdata",
     "h.cli.commands.devserver.devserver",
     "h.cli.commands.init.init",
     "h.cli.commands.initdb.initdb",
diff --git a/h/cli/commands/devdata.py b/h/cli/commands/devdata.py
new file mode 100644
index 000000000..2504ac5b7
--- /dev/null
+++ b/h/cli/commands/devdata.py
@@ -0,0 +1,58 @@
+import click
+
+from h import models
+
+
+def create_all_dev_data(request):
+    """Create all the dev data that doesn't exist yet."""
+    factory = DevDataFactory(request)
+
+    factory.user("user")
+    factory.user("admin", admin=True)
+    factory.user("staff", staff=True)
+
+    request.tm.commit()
+
+
+class DevDataFactory(object):
+    """A class that creates development data if it doesn't already exist."""
+
+    def __init__(self, request):
+        self.request = request
+        self.db = self.request.db
+        self.authority = "hypothes.is"
+
+    def user(self, username, admin=False, staff=False):
+        """Return the User with the given username and authority. Create it if necessary."""
+
+        def existing_user():
+            """Return the existing user with the given username and authority, or None."""
+            return (
+                self.db.query(models.User)
+                .filter_by(username=username, authority=self.authority)
+                .one_or_none()
+            )
+
+        def new_user():
+            """Create and return a new user with the given username and authority."""
+            signup_service = self.request.find_service(name="user_signup")
+            user = signup_service.signup(
+                username=username,
+                authority=self.authority,
+                admin=admin,
+                staff=staff,
+                email=f"{username}@example.com",
+                password="pass",
+                require_activation=False,
+            )
+            click.echo(f"Created user {user}")
+            return user
+
+        return existing_user() or new_user()
+
+
+@click.command()
+@click.pass_context
+def devdata(ctx):
+    request = ctx.obj["bootstrap"]()
+    create_all_dev_data(request)

One downside I see to a script is that now we are expecting devs to maintain their own version of this script which doesn't seem to fit into existing developer workflows. I can't speak for everyone but for me at least, most of the data I add is not done through scripting but by adding data via expected user paths (ui in the client, etc) and I find my biggest complaint is that I have to start from scratch every time I kill my docker container. Docker volumes output all of my data directly into a directory so even if I kill all my docker containers the data is still there and I don't have to start from scratch again. In the workflow you are describing it sounds like the only way to do this would be to modify a python script which means less time spent developing and more time maintaining and managing my dev dataset. You also have to consider the overhead of adding all that data every time you run make dev.

We can have make dev run make devdata automatically so in fact it'll be completely automatic: developers never need to do anything, and just always have the correct dev data in their DB

Generally the script can be re-run safely whenever you want and it'll just create or update the dev data, or do nothing if there's nothing to do, and never touch your other data

I'm having trouble picturing this in practice. It sounds like this script is trying to be a database migration script without following database migration practices. The script would have to be able to determine what data is there and what data needs to be updated and I think this would be lead to a complicated and fragile implementation. Either you keep it simple and remove all the data each time and start from scratch or you attempt to handle data modifications outside of the script's control and potentially fail in unexpected ways. If it does remove all the developer's local data each time make dev is run and sets it back to the default dataset, it would effectively force the dev to maintain their own version of this script if they want anything but the default dataset.

For private data a Python script can clone a private GitHub repo containing a JSON file to a temporary directory

This feels very invisible and fragile to me. Are there permissions issues the user could run into here? What about credential caching for GitHub? What if the user is working offline?

Docker has this concept of secrets which are files you can add as a requirement to the docker-compose where the user provides the secrets. If the file isn't provided the docker-compose will fail notifying you that you must provide the file containing the secret. It seems like it might be a good fit here. Of course then you would need some script to take these secrets and put them in the appropriate place in the db-but you'd only have to do that once upon initing the db.

One downside I see to a script is that now we are expecting devs to maintain their own version of this script which doesn't seem to fit into existing developer workflows.

I don't follow. Why would developers be expected to maintain their own version of this script? My understanding is that the point of this script is to populate the database with a standard, useful set of entities for various common testing scenarios. If the developer wants to create any additional entities of their own which are not part of the standard set, that's beyond the scope of this issue and they can do it in whatever way suits them.

I don't follow. Why would developers be expected to maintain their own version of this script? My understanding is that the point of this script is to populate the database with a standard, useful set of entities for various common testing scenarios.

@robertknight

We can have make dev run make devdata automatically so in fact it'll be completely automatic: developers never need to do anything, and just always have the correct dev data in their DB

^ This makes it sound like the script is going to be run every time the user runs make dev so it would have the potential to remove local changes the developer has made.

I'm thinking of some use cases here:

For lms when working on things like error when the lms url is not correct on the server for example, I actually go into the db and edit the lms url. I don't add a new entry as that would require me to then go into lms and register that new entry within lms. Instead, I just modify an existing one I'm already using. This would most likely not be compatible with using this script.

LMS is a pain because there are so many registrations that need to happen across the different services. H, LMS, and the given LMS instance all have to have data synchronized across them. It would be nice if we made something like I have going locally where the data persists across all 3 of these services where H, LMS, and canvas all use the same secrets, tokens, etc so I don't have to keep re-registering them. Would this script run across all the services or would there be one script for each database?

I think I'm approaching this more from a persistence of data than from an automatic creation of data. Certainly creating the data in the first place is annoying but what's more annoying to me is once it's created it can be lost again if the database container is killed.

I'd like to understand more about how this script is intended to be used and what it handles and what it doesn't. Just based on the above bullet points it sounds like it does many things:

It adds all the data tables to the db
It adds an initial dataset
It re-adds missing/broken data between make dev runs.
It performs migrations of the initial dataset.
It does this all without touching additional data the user has created.

Questions: How would we implement 3 and 5? Would this support the following workflow:

For lms when working on things like error when the lms url is not correct on the server for example, I actually go into the db and edit the lms url. I don't add a new entry as that would require me to then go into lms and register that new entry within lms. Instead, I just modify an existing one I'm already using.

A script to add the initial dataset is relatively simple and small in scope but a script that does 3, 4, and 5 seems like it could get pretty complicated and fragile pretty fast.

I think I'm approaching this more from a persistence of data than from an automatic creation of data

That's a separate problem. This card is titled Automate creating dev DB contents.

Can we create a separate card for being able to save and then later reload a snapshot of your entire DB? That can be done currently by using pg_dump to dump the contents of your DB to a file, and then using pg_restore to reload the file, and I'd recommend that approach as it's Postgres's standard way of backing up and restoring DB contents, it handles backup consistency in the face of concurrent usage, portability, loading backups into a different version of Postgres, partial restores, etc. The Postgres people have solved the problem of backing up and restoring Postgres DBs and we'd be well advised not to second guess them. We could add convenience aliases like make dbdump and make dbrestore. But let's please discuss it on a separate card as it's a separate issue.

I've pushed a proof-of-concept to a pre-closed PR, to help answer some of the questions raised about a script-based approach: https://github.com/hypothesis/h/pull/5698/

This only creates three users, it doesn't create all the users, groups, memberships, organizations, authclients, etc that're needed. But I think it should be enough to demonstrate the concept.

Some notes on this POC:

If you have a completely new DB---as in you haven't even initialized the DB schema yet---you can just run make devdata and it'll initialise the DB schema and create the three users. This is because make devdata uses tox's "dev" env, the same one that make dev uses, and tox already initializes the DB, if it's not initialised already, whenever this "dev" env is used.
If you do have an initialized DB, e.g. you already ran bin/hypothesis --dev init for some reason, but the DB is empty, you can just run make devdata and it'll create the three users. Again, this is an existing property of our tox setup, it will initialize the DB but only if it's not already initialized.
If you have an initialized DB and you already have custom data (e.g. custom users) in the DB, you can just run make devdata and it'll add the three dev users but not touch your existing data. The script doesn't have to go out of its way, particularly, to avoid touching existing custom data: just by virtue of being a script that only upserts certain users, this doesn't touch other objects in the DB.
If you already have the dev users in your DB (e.g. you already ran make devdata) you can just run make devdata again and it will do nothing. Again, the script doesn't have to go out of its way to provide this, it's just an effect of being an upsert script.
If we ship a change to the devdata script that changes one of the fields of one of the users, e.g. changes a user's email address, then you can just run make devdata and it will update the changed field
If we ship a change to the devdata script that adds a new object to the dev data, e.g. adds a fourth user, then you can just run make devdata and it will create just the new user
If you have partial dev data in your DB, e.g. you deleted one of the dev users, then you can just run make devdata and it'll recreate just the user you deleted
If you have incorrect dev data in your DB, e.g. you changed one of the fields of one of the dev users, then you can just run make devdata and it'll reset just the field you changed
If you deliberately want to modify the standard dev data, for example you want to edit an authclient to test what happens with the edited version, you have three choices:
1. You can edit the standard dev data, understanding that your edits will be temporary
  
  Based on the conversation above I no longer think that make dev should call make devdata automatically. You should have to call make devdata manually. It's one more command to run manually, but it means that any changes will not be overwritten when you run make dev. Your changes to the standard dev data won't be overwritten until you run make devdata again. But the idea is that any custom changes to the standard dev data are temporary and you should expect to run make devdata in future and overwrite them.
2. You can create custom data, instead of editing the standard dev data objects. Create a new authclient with the settings you want. make devdata will never overwrite this.
3. You can edit the dev data script and send a PR, so that the team can benefit from your changes.
There will be two separate make devdata scripts: one in h and one in lms (the only two of our apps that have DBs). That's how we'll create synchronized dev data so that the apps work together.
I don't think full on database migrations are needed here: just upserting should be enough.
I don't think this script is doing too many things at once. It does one thing: upserts standard dev objects. Initializing the DB if necessary is already handled by tox, and re-adding missing or broken data, not touching the user's custom data, etc are just an automatic consequence of simple upsert logic.
I don't think this script is complicated or fragile. Upsert logic is simple and robust: find the user with a given username or create it if it doesn't exist, and then set all its fields to the latest values. The script needs to do little more than call this upsert logic repeatedly for a set of users, groups, organizations, authclients.
The script will have to clone a private GitHub repo containing some of the values in a JSON file.

I don't think this is invisible: there'll be Python code that explicitly shells out to git, and when running it you'll see git clone happening in the terminal.

I don't think this'll be fragile either: Python is good at shelling out to things like git, and git/GitHub are fast and reliable.

It will require the user to have ssh/git set up so that they can access private Hypothesis repos. If they're not set up the git command will fail and the Python script can print out an error message saying maybe you need to set up private access and linking to the relevant docs. This should be straightforward.

Any solution to accessing private data is going to require some kind of authentication to be set up.

It does mean that you need an internet connection in order to run make devdata but I don't see how any solution is going to avoid that.
Since it's Python code, pull requests to modify the dev data will work well: diffing will work

If I understand the Docker volume-based solution correctly, this basically involves someone setting up their local DB with all the right data manually, and then saving a Docker image of this data, which everyone else will then pull down. That sounds to me like it's not going to work well with existing custom data in developers' DBs (you would lose all your data, each time you load or re-load the dev data), and also like updating the standard dev data set is going to be very awkward (how would you create a PR that adds one more auth client to the standard set? Once merged, how would the rest of the team insert the new auth client without losing existing custom data in their local DBs?). I think it will be equivalent to the existing https://github.com/hypothesis/devdata solution (which is based on SQL dumps): no one uses that because they can't insert the dev data into their DB without destroying their existing data; and updating the SQL dump to add to or modify the standard dev data is awkward.

I think key properties of the script-based approach are that you can always just run make devdata without thinking and it will always just create/update the standard dev data as necessary and never touch any custom data. This is necessary in order to get the team to actually use it, IMO. And updating the standard set of dev data is also easy: just change or modify a Python script (or the JSON values in the private repo) and send a PR.

hypothesis / product-backlog

Automate creating dev DB contents in h and lms #1045