hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Research solution for uploading and distributing raw data #2

Closed MikeTheCanuck closed 6 years ago

MikeTheCanuck commented 6 years ago

Project teams are obtaining raw data from public organizations and don't have a central means of storing and distributing these files to project members.

DevOps team has been asked to create a solution that provides a central storage and distribution location for all the five current project teams.

Assume that:

znmeb commented 6 years ago

Another assumption - there will be mixed human and automated access required. For example, the raw data Transportation Systems has is in Google Drive, but when we stand up a PostGIS container it will need to download a database dump file without a person having to do anything.

MikeTheCanuck commented 6 years ago

@znmeb tell us more about this - this sounds like a unique scenario for the transportation-systems team. Perhaps this is something we'll have to tackle once we get the basic scenario here addressed.

znmeb commented 6 years ago

It may be unique now but I expect everything will be integrated once we know more about the data. We now have raw data samples for all the inputs we're expecting, and I think there are at most two quarterly updates coming.

Our plan is that everything will be in PostgreSQL / PostGIS databases when it's deployed, and the most convenient way to manage those is with database dumps and restore processes.

The current workflow for API development is:

  1. Someone gives us a dataset in some format
  2. A team member (often myself) imports it into PostgreSQL / PostGIS, creates a database dump and uploads the dump to Google Drive.
  3. An API developer stands up a Docker compose local environment with a Geodjango container and a PostGIS container and restores the database dump into PostGIS.
MikeTheCanuck commented 6 years ago

OK, thanks for the additional details @znmeb.

Process for getting data into databases is being discussed #3.

iant01 commented 6 years ago

Will try to have a sample demo or upload examples of access rules and lifecycle rules using tag values. Here is a word doc to review and discuse at Thursdays meeting.

Archive Proposals-1.docx

znmeb commented 6 years ago

I have a largish (10 - 20 GB) test case you can use ;-) It needs to be available to developers / data scientists now and for deployment possibly in the future, although I think the deployed dataset will be smaller.

iant01 commented 6 years ago

Example of lifecycle rules. Lifecycle examples.docx

iant01 commented 6 years ago

Plan S3 Archive (Phase One):

Plan S3 Archive (Phase Two):

znmeb commented 6 years ago

@iant01 Does Centrl Office have a better upload bandwidth than the puny 6 megabits per second Comcast deigns to grant me? I have a six GB upload and it takes hours at my home speed.

DingoEatingFuzz commented 6 years ago

@znmeb certainly better than 6mbps

image

znmeb commented 6 years ago

Holy cow! That must cost them a fortune!

bhgrant8 commented 6 years ago

@MikeTheCanuck, @iant01, @DingoEatingFuzz setup an IAM profile for me last evening. I have uploaded data both through the AWS console and the AWSS-CLI. Our 4.2gb upload did timeout in the console, so definitely agree approaching even 4gb you will want to go through cli.

A few team members have verified they could download via direct link. @znmeb has tested programmatically through a wget as well which is how our local environment has been setup to pull data from another source. So looks like this should work for transportation-system's needs. Thanks for the great work!

znmeb commented 6 years ago

P.S.: I can add awscli to the Docker images easily - it turns out wget isn't installed by default either. ;-)

bhgrant8 commented 6 years ago

So follow up on the above, it appears that only .csv files are given read access for distribution to team members.

is it possible to add some additional extensions:

.mdb - for access based databases .backup - we have created basic pg_dumps of the data which is currently manually downloaded from our drive but then loaded into our local dev db containers programatically. as per discussion with @DingoEatingFuzz we hope to move towards a "single command" type environment to boot up for local dev. having folks start with an initial db backup vs. recreating the full ETL steps to go from raw to initial database would be preferred method for transportation-systems. this initial dump would not need to be deployed to a live db or even archived.

znmeb commented 6 years ago

The best test case we have for this currently is https://github.com/hackoregon/transportation-systems-ridership-analysis

  1. It follows the Cookiecutter Data Science conventions - raw data and interim data can be synced with S3 and are not maintained in Git. In this case the backups are considered interim data.
  2. Everything is in code and can be driven by a Makefile.

The Cookiecutter convention specifies a directory data with sub-directories data/raw and data/interim that are synced via aws s3 at a command line of the form

aws s3 sync s3://$(BUCKET)/data/ data/ # download
aws s3 sync data/ s3://$(BUCKET)/data/ # upload

So the workflow for a new person coming in is:

  1. Clone the repo.
  2. Install awscli and get the credentials.
  3. Install make.
  4. Do make sync_data_from_s3.

When they make a new interim or result dataset, they type make sync_data_to_s3.

Once we get this one all integrated I'm planning to do the same for the crash data. That one's a bit more complicated because we have a lot more people currently working with the dataset so there's more risk of impacting peoples' workflows. And the raw data is an Access mdb file.

If at all possible I want to get the ridership workflow closed out this week - we have new data scientists coming in next week to onboard.

bhgrant8 commented 6 years ago

Hey thanks looks like we can pull the backups and mdb files. this should be enough for moment. going to add slack convo from transportation-systems channel for some additional context:

brian_g [4:16 PM] @znmeb so the conversation w/ devops has been that a) only data managers will have write access to our s3 buckets, and that b) read access for rest of team would be through the direct links. so in reference to your question on the github issue, and the fact that working with access files might cause some issues for folks, could we get by with users starting from the backup files for most of the team vs. the raw data? and then use makefiles/sql scripts locally keep us in sync from that point w/o syncing back and forth to aws in the interim.

znmeb [5:31 PM] We can do whatever makes sense - the only thing dictating syncing intermediate data up to S3 is the Cookiecutter philosophy. :wink: Incidentally, there is a utility called rclone that can sync to just about every public cloud out there - DropBox, Microsoft OneDrive, Google Drive, etc. If we want to do cloud data sharing the only issue is cost / credentials The Cookiecutter philosophy makes a lot of sense to me though. Collaborators all have the same data to work with on their laptops / workstations. And of course I have no clue what the other teams' data sizes are - S3 may not be viable for the Disaster Resilience team from what I've seen.

brian_g [7:54 PM] yeah the back and forth transfer could be a concern, though I am not sure of what the costs were last year when we were developing against a shared database on an ec2 instance.

the other thing is that versioning and the like is not the same in s3 as it is in github. people syncing a bunch of changes and writing new versions or writing over each other could be a bit of a headache if we need to roll back. alternatively it would quickly raise costs if we are all sharing different versions of interim data. honestly, i also doubt we are going to be able to gain much traction requesting to open write credentials to a large number of people in a volunteer project .

my train of thought was if we all start with the common dump of the data, then modifications to this original file could be done through sql files/makefiles in conjunction with data science. these change files would be relatively small and could be checked into git/approved and run on each person’s machine. we could then upload some benched marked versions of db in s3, that could be downloaded by anyone on the team through a direct link. this would allow for some review process along with some ability to re-sync.

as you said the question is cost/credentials combined with a question of the best way to collaborate around data.

znmeb [8:31 PM] Let me take another look at dat - that was billed as "Git for data". For now, just declaring the two .backup files we have as raw data and putting them in S3 for all to download should be enough. I've never used Cookiecutter before so I don't have a lot invested in it like I do PostGIS and Docker. https://datproject.org/ - we had somebody from the dat project working with us last year but nothing got built Dat Project Dat Project - A Distributed Data Community Dat is the nonprofit-backed technology & community for building apps of the future.

Michael Lange [8:35 PM] dat is pretty awesome, but I know that not every government associated entity is cool with p2p tech yet I used dat to transfer 80gb+ from my machine to gcp and it went without a hitch

znmeb [8:51 PM] I was thinking just within Hack Oregon for collaboration, not between Hack Oregon and the agencies But somebody would need to manage the server - isn't there a "tracker" like for Bittorrent? One way or another you've got a box on AWS holding the "ground truth"

DingoEatingFuzz commented 6 years ago

I extracted @iant01's extended archival plan to a new issue #49. With that, I believe this issue is done.