Document how to create initial data for tables

fyliu commented 2 years ago

Overview

Since some of our database tables need to be pre-populated with data, we need to decide on a way to do it and document it.

Action Items

[x] Research ways to add initial data
[x] Decide on a way that makes sense for us
[x] Document it
[ ] remove dependency in #144 and move it into New Issue Review

Resources/Instructions

3 options for django model initial data set up is a pretty comprehensive guide to the ways to create initial data
django dumpdata script generates json from existing data
django data migrations for reference on how to write migration files
dumpscript from django-extensions generates python code from existing data
~~django-seed only generates random data, so it's not useful to us~~
exportjson.js blog post and script

In any solution, we need to be able to connect it to django's migration system or in a script so it can be auto-inserted on clean build.

Data-creation code directly inside the migration file. OR just use as standalone script.
1. From spreadsheet
  1. Use App Script to generate json file
  2. Use a script to generate the python code for adding the data
2. From existing table data
  1. Use dumpscript to generate the python code for adding the data
Load SQL script from django migration file Place the sql script in <app>/sql/<app>.sql
1. From spreadsheet
  1. Use App Script to generate json file
  2. Use a script to generate the SQL commands from json
2. From existing table data
  1. Use DBeaver to generate insert statements
Load fixture from django migration file Place the migration file in <app>/fixtures/<app>.json
1. From spreadsheet
  1. Use App Script to generate json file
  2. Use a script to rewrite json in django migration format
2. From existing table data
  1. Use dumpdata to generate json from data already in the db

fyliu commented 2 years ago

I've looked into options 1 and 3ii. They are both viable options and I think the fixture option is the better one since we can also use it for testing, although with the django_db decorator will automatically apply all the migrations, so all the options will work in that case. We don't have any non-django_db tests but we might eventually. So the fixture option is still preferable if other factors are equal.

In a call with @ExperimentsInHonesty, I learned about the possibility of generating json from G sheets using an app script. I later tried option 3i. It also works well. The advantage of it is that it may be a more direct path from the initial spreadsheet to the final migration file. The options that dumps data from existing tables have the extra step of first inserting the data into the system through the admin site. That's a potentially step to introduce user errors.

To compare the work involved between from spreadsheet and from table data

both
- same migrations file setup
from sheet
- more direct, one less step for potential copy/paste error
- there's a single copy/paste and save as json file
- need extra script to reshape json from spreadsheet to the proper format
from table data
- need to add data into the db first, from admin site, which is relatively user-friendly
- the amount of this manual work depends on how many rows need to be inserted. Most tables are very simple. SDG-related table rows can be a chore.
- need a script or manual work to remove automatic fields like uuid

Now that it's laid out, starting from sheets data has more opportunities for automation, so it feels more future-proof.

The next step will be to evaluate if the option 3i route I've done will be adequate for our needs.

problem: one thing I noticed is that not providing created_at and updated_at values causes an error for a non-nullable field. Making them nullable gets around this but then there's no timestamps in those rows.

fyliu commented 2 years ago

Another possibility

This one imports the data from a CSV file. The code is custom to each model but it can likely be made more generic.

fyliu commented 2 years ago

To answer the null issue from 2 comments ago, loading from fixture is meant to not activate the auto timestamp behavior defined in the model. The load from fixture functionality is meant to be for dumping and loading data that was already in the database at one point and not for loading initial data. See this django ticket for details about why they won't change this behavior.

So we now have to choose:

Generate the fixtures from spreadsheet and allow the timestamps fields to be null. It may not be a disadvantage to not have it for the pre-populated data.
Manually insert the data into django and use dumpdata like it's intended to be used. This will write the timestamps from the db.
Generate python code from spreadsheet so that it'll go through the normal route to adding data to the db and have the timestamps.

fyliu commented 2 years ago

This is documented here and linked from the wiki.

The strategy is to export the JSON from the spreadsheet, convert it into a python script that can insert the data, then create a migration file that will call the script when it's processed.

The advantages of using a python script as opposed to importing the data in JSON form using loaddata:

it will generate timestamps where as loaddata needs the JSON file to contain timestamps for each record
loaddata requires the data to be in the database and dumped using dumpdata
manually (via script) creating the JSON data dump is possible, but it won't have the necessary timestamps
Making timestamps optional will get around the requirement, but that's not good practice
manually (via script) adding timestamps is possible, but it would look weird to have some data that's way before the website is launched. The JSON file would need to be saved to revision control, which is also weird.

fyliu commented 2 years ago

I finished this and then realized there's a better way to do this. The initial data shouldn't be made into migrations if this project is supposed to be generic. They should be left as individual scripts to be run on initial database setup by us, and customized by other organizations for their needs.

fyliu commented 1 year ago

I have a draft PR #141 for this. It's a draft because it's dependent on #140 to be merged.

140
141

fyliu commented 1 year ago

I modified the PR to not depend on sphinx so we can move this forward.

fyliu commented 1 year ago

There's no model created that has a need for initial data. This makes it more difficult to review the database insertion part of #141. I will go implement #24.

fyliu commented 1 year ago

Looks like #24 has the same structure as #35 and we don't have a direction for working on #35 yet. So I can't work on #24.

The other thing I can do is pull the SOC_Major table issue out of the ice box, but I would rather not do that since it's going to be prioritized in the far future.

Blocker: I need to implement a table with initial data, and there's no good one to do. This is blocking on #35.

35

fyliu commented 10 months ago

This issue was implemented by PR #226

226

hackforla / peopledepot