hypothesis / product-backlog

Where new feature ideas and current bugs for the Hypothesis product live
118 stars 7 forks source link

Automate creating dev DB contents in h and lms #1045

Closed seanh closed 5 years ago

seanh commented 5 years ago

Belongs to EPIC: https://github.com/hypothesis/product-backlog/issues/1044

Problem: Currently you have to manually create a lot of stuff (users, groups, authclients, ...) in your h and lms DBs in dev in order to get those two apps working together in dev.

Requirements:

See also:

hmstepanek commented 5 years ago

I did some research on this yesterday. You can attach a volume to each postgres container and it will output data into the volume (aka mounted local directory). This could then be committed somewhere where users could pull it down. As mentioned above the issue with this is the committed directory containing sensitive information, although it might be possible to put this in a private or encrypted repo like we use for other sensitive information. The sensitive things here being secrets used to connect/authenticate to external services like lms or google.

As mentioned above, you also have to consider how migrations would work with this as this would be a base data set that potentially devs might build upon locally. While their data would persist locally handling updates to the base data set and merging those changes in might require some more thought. I'd assert however, the base dataset isn't likely to change very often. One solution that comes to mind for this would be to have migration scripts setup for the base dataset just like we do for production. This would require a lot of maintenance overhead though.

I'd recommend as a first step in order to scope this down to something less complicated and do-able to not handle migrations of the base data set. Having a default data set is a step up from what we have now and I think we should start there and iterate on it.

seanh commented 5 years ago

I'd like us to consider an alternative suggestion -- add a Python script make devdata that inserts the development data into the DB if it doesn't exist already, as in the diff below.

The script should be written to create the development data (users, groups, organizations, authclients, ...) if it doesn't already exist, or update it to have the right values if it's incorrect, while not touching any other data in the DB.

This approach seems to have a lot of advantages:

I'm having a hard time thinking of any downsides to this approach.

Below is a rough diff that works as a proof-of-concept for this, adding some dev users. See https://github.com/hypothesis/h/pull/4963 for a more complete example.

commit 03a4aced59b11bd1a0f8d7cc34b761d7c7e15efb (HEAD -> devdata)
Author: Sean Hammond <git@snhmnd.fastmail.com>
Date:   Tue Aug 27 11:17:36 2019 +0100

    Add devdata command

diff --git a/Makefile b/Makefile
index bbfbe75ee..eb99280a4 100644
--- a/Makefile
+++ b/Makefile
@@ -4,6 +4,7 @@ help:
    @echo 'make services          Run the services that `make dev` requires'
    @echo "                       (Postgres, Elasticsearch, etc) in Docker Compose"
    @echo "make dev               Run the app in the development server"
+   @echo "make devdata           Insert standard development data into the DB"
    @echo "make shell             Launch a Python shell in the dev environment"
    @echo "make sql               Connect to the dev database with a psql shell"
    @echo "make lint              Run the code linter(s) and print any warnings"
@@ -37,6 +38,10 @@ services:
 dev: build/manifest.json python
    tox -q -e py36-dev

+.PHONY: devdata
+devdata: python
+   @tox -q -e py36-dev -- sh bin/hypothesis --dev devdata
+
 .PHONY: shell
 shell: python
    tox -q -e py36-dev -- sh bin/hypothesis --dev shell
diff --git a/h/cli/__init__.py b/h/cli/__init__.py
index 7e5434625..4f2a5e47e 100644
--- a/h/cli/__init__.py
+++ b/h/cli/__init__.py
@@ -17,6 +17,7 @@ SUBCOMMANDS = (
     "h.cli.commands.annotation_id.annotation_id",
     "h.cli.commands.authclient.authclient",
     "h.cli.commands.celery.celery",
+    "h.cli.commands.devdata.devdata",
     "h.cli.commands.devserver.devserver",
     "h.cli.commands.init.init",
     "h.cli.commands.initdb.initdb",
diff --git a/h/cli/commands/devdata.py b/h/cli/commands/devdata.py
new file mode 100644
index 000000000..2504ac5b7
--- /dev/null
+++ b/h/cli/commands/devdata.py
@@ -0,0 +1,58 @@
+import click
+
+from h import models
+
+
+def create_all_dev_data(request):
+    """Create all the dev data that doesn't exist yet."""
+    factory = DevDataFactory(request)
+
+    factory.user("user")
+    factory.user("admin", admin=True)
+    factory.user("staff", staff=True)
+
+    request.tm.commit()
+
+
+class DevDataFactory(object):
+    """A class that creates development data if it doesn't already exist."""
+
+    def __init__(self, request):
+        self.request = request
+        self.db = self.request.db
+        self.authority = "hypothes.is"
+
+    def user(self, username, admin=False, staff=False):
+        """Return the User with the given username and authority. Create it if necessary."""
+
+        def existing_user():
+            """Return the existing user with the given username and authority, or None."""
+            return (
+                self.db.query(models.User)
+                .filter_by(username=username, authority=self.authority)
+                .one_or_none()
+            )
+
+        def new_user():
+            """Create and return a new user with the given username and authority."""
+            signup_service = self.request.find_service(name="user_signup")
+            user = signup_service.signup(
+                username=username,
+                authority=self.authority,
+                admin=admin,
+                staff=staff,
+                email=f"{username}@example.com",
+                password="pass",
+                require_activation=False,
+            )
+            click.echo(f"Created user {user}")
+            return user
+
+        return existing_user() or new_user()
+
+
+@click.command()
+@click.pass_context
+def devdata(ctx):
+    request = ctx.obj["bootstrap"]()
+    create_all_dev_data(request)
hmstepanek commented 5 years ago

One downside I see to a script is that now we are expecting devs to maintain their own version of this script which doesn't seem to fit into existing developer workflows. I can't speak for everyone but for me at least, most of the data I add is not done through scripting but by adding data via expected user paths (ui in the client, etc) and I find my biggest complaint is that I have to start from scratch every time I kill my docker container. Docker volumes output all of my data directly into a directory so even if I kill all my docker containers the data is still there and I don't have to start from scratch again. In the workflow you are describing it sounds like the only way to do this would be to modify a python script which means less time spent developing and more time maintaining and managing my dev dataset. You also have to consider the overhead of adding all that data every time you run make dev.

We can have make dev run make devdata automatically so in fact it'll be completely automatic: developers never need to do anything, and just always have the correct dev data in their DB

Generally the script can be re-run safely whenever you want and it'll just create or update the dev data, or do nothing if there's nothing to do, and never touch your other data

I'm having trouble picturing this in practice. It sounds like this script is trying to be a database migration script without following database migration practices. The script would have to be able to determine what data is there and what data needs to be updated and I think this would be lead to a complicated and fragile implementation. Either you keep it simple and remove all the data each time and start from scratch or you attempt to handle data modifications outside of the script's control and potentially fail in unexpected ways. If it does remove all the developer's local data each time make dev is run and sets it back to the default dataset, it would effectively force the dev to maintain their own version of this script if they want anything but the default dataset.

For private data a Python script can clone a private GitHub repo containing a JSON file to a temporary directory

This feels very invisible and fragile to me. Are there permissions issues the user could run into here? What about credential caching for GitHub? What if the user is working offline?

Docker has this concept of secrets which are files you can add as a requirement to the docker-compose where the user provides the secrets. If the file isn't provided the docker-compose will fail notifying you that you must provide the file containing the secret. It seems like it might be a good fit here. Of course then you would need some script to take these secrets and put them in the appropriate place in the db-but you'd only have to do that once upon initing the db.

robertknight commented 5 years ago

One downside I see to a script is that now we are expecting devs to maintain their own version of this script which doesn't seem to fit into existing developer workflows.

I don't follow. Why would developers be expected to maintain their own version of this script? My understanding is that the point of this script is to populate the database with a standard, useful set of entities for various common testing scenarios. If the developer wants to create any additional entities of their own which are not part of the standard set, that's beyond the scope of this issue and they can do it in whatever way suits them.

hmstepanek commented 5 years ago

I don't follow. Why would developers be expected to maintain their own version of this script? My understanding is that the point of this script is to populate the database with a standard, useful set of entities for various common testing scenarios.

@robertknight

We can have make dev run make devdata automatically so in fact it'll be completely automatic: developers never need to do anything, and just always have the correct dev data in their DB

^ This makes it sound like the script is going to be run every time the user runs make dev so it would have the potential to remove local changes the developer has made.

hmstepanek commented 5 years ago

I'm thinking of some use cases here:

For lms when working on things like error when the lms url is not correct on the server for example, I actually go into the db and edit the lms url. I don't add a new entry as that would require me to then go into lms and register that new entry within lms. Instead, I just modify an existing one I'm already using. This would most likely not be compatible with using this script.

LMS is a pain because there are so many registrations that need to happen across the different services. H, LMS, and the given LMS instance all have to have data synchronized across them. It would be nice if we made something like I have going locally where the data persists across all 3 of these services where H, LMS, and canvas all use the same secrets, tokens, etc so I don't have to keep re-registering them. Would this script run across all the services or would there be one script for each database?

hmstepanek commented 5 years ago

I think I'm approaching this more from a persistence of data than from an automatic creation of data. Certainly creating the data in the first place is annoying but what's more annoying to me is once it's created it can be lost again if the database container is killed.

I'd like to understand more about how this script is intended to be used and what it handles and what it doesn't. Just based on the above bullet points it sounds like it does many things:

  1. It adds all the data tables to the db
  2. It adds an initial dataset
  3. It re-adds missing/broken data between make dev runs.
  4. It performs migrations of the initial dataset.
  5. It does this all without touching additional data the user has created.

Questions: How would we implement 3 and 5? Would this support the following workflow:

For lms when working on things like error when the lms url is not correct on the server for example, I actually go into the db and edit the lms url. I don't add a new entry as that would require me to then go into lms and register that new entry within lms. Instead, I just modify an existing one I'm already using.

A script to add the initial dataset is relatively simple and small in scope but a script that does 3, 4, and 5 seems like it could get pretty complicated and fragile pretty fast.

seanh commented 5 years ago

I think I'm approaching this more from a persistence of data than from an automatic creation of data

That's a separate problem. This card is titled Automate creating dev DB contents.

Can we create a separate card for being able to save and then later reload a snapshot of your entire DB? That can be done currently by using pg_dump to dump the contents of your DB to a file, and then using pg_restore to reload the file, and I'd recommend that approach as it's Postgres's standard way of backing up and restoring DB contents, it handles backup consistency in the face of concurrent usage, portability, loading backups into a different version of Postgres, partial restores, etc. The Postgres people have solved the problem of backing up and restoring Postgres DBs and we'd be well advised not to second guess them. We could add convenience aliases like make dbdump and make dbrestore. But let's please discuss it on a separate card as it's a separate issue.

seanh commented 5 years ago

I've pushed a proof-of-concept to a pre-closed PR, to help answer some of the questions raised about a script-based approach: https://github.com/hypothesis/h/pull/5698/

This only creates three users, it doesn't create all the users, groups, memberships, organizations, authclients, etc that're needed. But I think it should be enough to demonstrate the concept.

Some notes on this POC:

seanh commented 5 years ago

If I understand the Docker volume-based solution correctly, this basically involves someone setting up their local DB with all the right data manually, and then saving a Docker image of this data, which everyone else will then pull down. That sounds to me like it's not going to work well with existing custom data in developers' DBs (you would lose all your data, each time you load or re-load the dev data), and also like updating the standard dev data set is going to be very awkward (how would you create a PR that adds one more auth client to the standard set? Once merged, how would the rest of the team insert the new auth client without losing existing custom data in their local DBs?). I think it will be equivalent to the existing https://github.com/hypothesis/devdata solution (which is based on SQL dumps): no one uses that because they can't insert the dev data into their DB without destroying their existing data; and updating the SQL dump to add to or modify the standard dev data is awkward.

I think key properties of the script-based approach are that you can always just run make devdata without thinking and it will always just create/update the standard dev data as necessary and never touch any custom data. This is necessary in order to get the team to actually use it, IMO. And updating the standard set of dev data is also easy: just change or modify a Python script (or the JSON values in the private repo) and send a PR.