What is the data loading procedure?

adpeters commented 6 years ago

One thing I don't see in this exemplar is any method of actually loading data from a source. Maybe I'm missing it, but what is our general approach to this?

Building off of last year's housing backend, my current method is to create a loader.py file that takes data from whatever sources you have and puts it into the database via the Django ORM. This file runs on the startup of the api container, but you could also have it run manually.

So I'm thinking development-docker-entrypoint.sh should have something like the following:

echo "Load data"
./manage.py shell --command="import data.loader"

echo "Create Postgres backup file"
pg_dump -h "$POSTGRES_HOST" -U "$POSTGRES_USER" -p "$POSTGRES_PORT" $POSTGRES_NAME > ./Backups/$POSTGRES_NAME.sql

What do people think of this process?

znmeb commented 6 years ago

The db container has automatic restores - you put a .backup, .sql or .sql.gz backup into a directory when you build the image and it gets restored when the image first runs in a container. See https://github.com/hackoregon/data-science-pet-containers#using-automatic-database-restores for the details.
In the deployed apps, we've got a formal process:
- The team creates backup files.
- The team data manager loads them to AWS - S3, IIRC.
- The DevOps team restores them to the PostgreSQL server instance.
To facilitate testing, there's an image in data-science-pet-containers that looks much like our AWS instance. See https://github.com/hackoregon/data-science-pet-containers#amazon

adpeters commented 6 years ago

@znmeb okay thanks, that makes sense, but where do the .backup or ..sql files come from? They have to come from an existing database that has been populated with data and I don't think an example of that process is included in this repo.

znmeb commented 6 years ago

@adpeters The database backups are artifacts that the teams make via iteration. For example, Transportation Systems started with two raw data sets - a CSV file of TriMet ridership data and a Microsoft Access .mdb file of ODOT crash data. The backups we're currently using were made with two custom ETL scripts I wrote; they're examples in the Data Science Pet Containers repository.

As we develop the stories, we'll probably modify these databases - add columns, for example - and may add new ones. This exemplar is a tool for facilitating that process.

adpeters commented 6 years ago

Okay, so do we want to include an example of this process in this repo, such as your example above, or my example using the Django ORM? Or at least an explanation of how it might be done?

Or do we want the data importing to be a completely separate process that this api backend does not have any say in?

znmeb commented 6 years ago

@adpeters If you're volunteering to write up your example, I won't stop you. ;-) I've got a fairly full documentation task list already.

BrianHGrant commented 6 years ago

Django app will have read-only credentials unless you have a very specific use case why and clear with devops.

Loading static JSON files would be different, and I did want to provide an example of this.

So assumption should be data load is separate and read-only views

On Fri, Apr 13, 2018, 4:54 PM Alec Peters notifications@github.com wrote:

Okay, so do we want to include an example of this process in this repo, such as your example above, or my example using the Django ORM? Or at least an explanation of how it might be done?

Or do we want the data importing to be a completely separate process that this api backend does not have any say in?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hackoregon/backend-examplar-2018/issues/21#issuecomment-381286328, or mute the thread https://github.com/notifications/unsubscribe-auth/ARHAUUj3Wk6WmgFpw81GPLs0y8rhPLvlks5toTrSgaJpZM4TUk5l .

znmeb commented 6 years ago

@BrianHGrant Currently isn't the API connecting as postgres even though the design is read-only, one non-superuser per team? I'm cleaning up that piece of the database container at the moment, so I want to make sure I'm getting it right.

BrianHGrant commented 6 years ago

Yeah it maybe, I did a pretty quick pass on the database config side. Ideally though example would be connecting as read only. On the API side, we are using read-only views and non-managed models

On Fri, Apr 13, 2018, 5:23 PM M. Edward (Ed) Borasky < notifications@github.com> wrote:

@BrianHGrant https://github.com/BrianHGrant Currently isn't the API connecting as postgres even though the design is read-only, one non-superuser per team? I'm cleaning up that piece of the database container at the moment, so I want to make sure I'm getting it right.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hackoregon/backend-examplar-2018/issues/21#issuecomment-381289378, or mute the thread https://github.com/notifications/unsubscribe-auth/ARHAUbo7IyN0WJmnZxICCSQNoDa1-yZbks5toUF5gaJpZM4TUk5l .

znmeb commented 6 years ago

OK - we'll just have to document the setup and let the teams build on it. We can catch non-working scenarios later.

adpeters commented 6 years ago

Okay I guess I want to be clear on what our preferred/intended method for data importing is. Do we want it to be a part of a project's "api" repository, or should it come from a separate repository and the api repository just gets a Postgres .sql file? This is all in development, of course. Staging/production will always just get the .sql restore file.

It looks to me like transportation is doing it the latter way, so somebody looking in their api repo would not see where the data comes from, whereas last year's housing team did the former and has the data import included in the api repo.

Could we run into issues if we are declaring a database schema through the Django ORM and expecting data read out of the database to fit into it the Django models, but also creating our own schema via SQL for the data import. Couldn't the two different schema definitions could lead to inconsistencies?

znmeb commented 6 years ago

That's a good question - Transportation Systems started with two datasets - a CSV file and a Microsoft Access database. I imported them to PostgGIS and made backups and documented the processing required. I'm not sure what the other teams are doing over all, but I know Disaster Resilience is building a PostGIS database and I've restore-tested their backups.

However we do things, they need to be documented and reproducible / repeatable, so one way or another we'd get inconsistencies ironed out in the final product.

BrianHGrant commented 6 years ago

As different teams will have different data loading procedures, it may make sense for some teams to include this within their backend repos.

To keep this example app fairly clean, and easy to understand the basic needs but allow further collaboration around common tasks i would recommend we look at something like a backend-extensions repo.

We could then separate these reproducible tasks into separate django apps with instructions on how to incorporate and a requirements file outlining dependencies.

Eventually this provides a framework to turn these into python/pypy modules, though getting to this point is a bit out of scope for this point in our timeline this project season

On Sun, Apr 15, 2018, 2:43 PM M. Edward (Ed) Borasky < notifications@github.com> wrote:

That's a good question - Transportation Systems started with two datasets

a CSV file and a Microsoft Access database. I imported them to PostgGIS and made backups and documented the processing required. I'm not sure what the other teams are doing over all, but I know Disaster Resilience is building a PostGIS database and I've restore-tested their backups.

However we do things, they need to be documented and reproducible / repeatable, so one way or another we'd get inconsistencies ironed out in the final product.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hackoregon/backend-examplar-2018/issues/21#issuecomment-381440401, or mute the thread https://github.com/notifications/unsubscribe-auth/ARHAUWayYM5zMbl9A1hekAfMoxyn4vJlks5to77ygaJpZM4TUk5l .

znmeb commented 6 years ago

If we've answered the question, can we close this? We've got quite a few "discussion issues" on this repo that might be better handled in Slack or in team meetings. ;-)

bhgrant8 commented 6 years ago

In summary:

Data intake and loading should be documented and reproducible.
Teams should decide whether this lives within a single repo or separate, taking into account their own data sources and requirements.
Production apis should assume a read-only schema, connecting to a non-managed database
creating reproducible apps will be a benefit to the whole project, and can be entered into a to be created backend-extensions repo.
we will want this exemplar to show basics of what would make a functioning, deployable API.

Going to close this ticket as out of scope for this particular repo.

hackoregon / backend-examplar-2018

What is the data loading procedure? #21