Research solution for uploading and distributing raw data

MikeTheCanuck commented 6 years ago

Project teams are obtaining raw data from public organizations and don't have a central means of storing and distributing these files to project members.

DevOps team has been asked to create a solution that provides a central storage and distribution location for all the five current project teams.

Assume that:

not all data is considered "public" - some of the raw data should be considered confidentially shared with HackOregon
project teams will have more than one file at a time, and will obtain multiple lots of data over time
project team members will download these files more than once each over time
project teams will create intermediate derivatives of this data that they'd also like to share with their team members
a very limited number of individuals should have write access to the location
all members of each team should have read access to their project's data
there will be frequent, ad-hoc and unpredictable access required to this data by project teams during the first month or two
activity in the central storage location should decrease during the project season

znmeb commented 6 years ago

Another assumption - there will be mixed human and automated access required. For example, the raw data Transportation Systems has is in Google Drive, but when we stand up a PostGIS container it will need to download a database dump file without a person having to do anything.

MikeTheCanuck commented 6 years ago

@znmeb tell us more about this - this sounds like a unique scenario for the transportation-systems team. Perhaps this is something we'll have to tackle once we get the basic scenario here addressed.

znmeb commented 6 years ago

It may be unique now but I expect everything will be integrated once we know more about the data. We now have raw data samples for all the inputs we're expecting, and I think there are at most two quarterly updates coming.

Our plan is that everything will be in PostgreSQL / PostGIS databases when it's deployed, and the most convenient way to manage those is with database dumps and restore processes.

The current workflow for API development is:

Someone gives us a dataset in some format
A team member (often myself) imports it into PostgreSQL / PostGIS, creates a database dump and uploads the dump to Google Drive.
An API developer stands up a Docker compose local environment with a Geodjango container and a PostGIS container and restores the database dump into PostGIS.

MikeTheCanuck commented 6 years ago

OK, thanks for the additional details @znmeb.

Process for getting data into databases is being discussed #3.

iant01 commented 6 years ago

Will try to have a sample demo or upload examples of access rules and lifecycle rules using tag values. Here is a word doc to review and discuse at Thursdays meeting.

Archive Proposals-1.docx

znmeb commented 6 years ago

I have a largish (10 - 20 GB) test case you can use ;-) It needs to be available to developers / data scientists now and for deployment possibly in the future, although I think the deployed dataset will be smaller.

iant01 commented 6 years ago

Example of lifecycle rules. Lifecycle examples.docx

iant01 commented 6 years ago

Plan S3 Archive (Phase One):

[x] Create S3 Bucket "hacko-archive" with default access permissions (owner only)
[x] Create sub-folders for each Project with year prefix (I.e. 2018-transportation-systems )
[x] Document instructions to upload file less than 5GB (Project leads should have user login to AWS hacko with S3 read/write access )
[x] Document instructions to upload file greater than 5GB (Project leads should have user login to AWS hacko with S3 read/write access plus user Access Key/Secret Key)

Plan S3 Archive (Phase Two):

[ ] Develop Life Cycle rules to move items/files to lower cost storage through use of "Archive" tag
[ ] Document use of tags for Life Cycle
[ ] Review Life Cycle rules and use documentation with Team to get Deploy approval
[ ] Deploy Life Cycle rules.
[ ] Develop role/group policy rules for Archive access using conditionals with tags and paths
[ ] Document use of tags/paths for access control in roles/group policies for Archive access
[ ] Review Role /Group policy access rules and documentation with Team to get deploy approval
[ ] Deploy Archive access Role/Group policy rules
[ ] Monitor Archive use (tagging, access, storage classes, etc)

znmeb commented 6 years ago

@iant01 Does Centrl Office have a better upload bandwidth than the puny 6 megabits per second Comcast deigns to grant me? I have a six GB upload and it takes hours at my home speed.

DingoEatingFuzz commented 6 years ago

@znmeb certainly better than 6mbps

znmeb commented 6 years ago

Holy cow! That must cost them a fortune!

bhgrant8 commented 6 years ago

@MikeTheCanuck, @iant01, @DingoEatingFuzz setup an IAM profile for me last evening. I have uploaded data both through the AWS console and the AWSS-CLI. Our 4.2gb upload did timeout in the console, so definitely agree approaching even 4gb you will want to go through cli.

A few team members have verified they could download via direct link. @znmeb has tested programmatically through a wget as well which is how our local environment has been setup to pull data from another source. So looks like this should work for transportation-system's needs. Thanks for the great work!

znmeb commented 6 years ago

P.S.: I can add awscli to the Docker images easily - it turns out wget isn't installed by default either. ;-)

bhgrant8 commented 6 years ago

So follow up on the above, it appears that only .csv files are given read access for distribution to team members.

is it possible to add some additional extensions:

.mdb - for access based databases .backup - we have created basic pg_dumps of the data which is currently manually downloaded from our drive but then loaded into our local dev db containers programatically. as per discussion with @DingoEatingFuzz we hope to move towards a "single command" type environment to boot up for local dev. having folks start with an initial db backup vs. recreating the full ETL steps to go from raw to initial database would be preferred method for transportation-systems. this initial dump would not need to be deployed to a live db or even archived.

znmeb commented 6 years ago

The best test case we have for this currently is https://github.com/hackoregon/transportation-systems-ridership-analysis

It follows the Cookiecutter Data Science conventions - raw data and interim data can be synced with S3 and are not maintained in Git. In this case the backups are considered interim data.
Everything is in code and can be driven by a Makefile.

The Cookiecutter convention specifies a directory data with sub-directories data/raw and data/interim that are synced via aws s3 at a command line of the form

aws s3 sync s3://$(BUCKET)/data/ data/ # download
aws s3 sync data/ s3://$(BUCKET)/data/ # upload

So the workflow for a new person coming in is:

Clone the repo.
Install awscli and get the credentials.
Install make.
Do make sync_data_from_s3.

When they make a new interim or result dataset, they type make sync_data_to_s3.

Once we get this one all integrated I'm planning to do the same for the crash data. That one's a bit more complicated because we have a lot more people currently working with the dataset so there's more risk of impacting peoples' workflows. And the raw data is an Access mdb file.

If at all possible I want to get the ridership workflow closed out this week - we have new data scientists coming in next week to onboard.

bhgrant8 commented 6 years ago

Hey thanks looks like we can pull the backups and mdb files. this should be enough for moment. going to add slack convo from transportation-systems channel for some additional context:

brian_g [4:16 PM] @znmeb so the conversation w/ devops has been that a) only data managers will have write access to our s3 buckets, and that b) read access for rest of team would be through the direct links. so in reference to your question on the github issue, and the fact that working with access files might cause some issues for folks, could we get by with users starting from the backup files for most of the team vs. the raw data? and then use makefiles/sql scripts locally keep us in sync from that point w/o syncing back and forth to aws in the interim.

znmeb [5:31 PM] We can do whatever makes sense - the only thing dictating syncing intermediate data up to S3 is the Cookiecutter philosophy. :wink: Incidentally, there is a utility called rclone that can sync to just about every public cloud out there - DropBox, Microsoft OneDrive, Google Drive, etc. If we want to do cloud data sharing the only issue is cost / credentials The Cookiecutter philosophy makes a lot of sense to me though. Collaborators all have the same data to work with on their laptops / workstations. And of course I have no clue what the other teams' data sizes are - S3 may not be viable for the Disaster Resilience team from what I've seen.

brian_g [7:54 PM] yeah the back and forth transfer could be a concern, though I am not sure of what the costs were last year when we were developing against a shared database on an ec2 instance.

the other thing is that versioning and the like is not the same in s3 as it is in github. people syncing a bunch of changes and writing new versions or writing over each other could be a bit of a headache if we need to roll back. alternatively it would quickly raise costs if we are all sharing different versions of interim data. honestly, i also doubt we are going to be able to gain much traction requesting to open write credentials to a large number of people in a volunteer project .

my train of thought was if we all start with the common dump of the data, then modifications to this original file could be done through sql files/makefiles in conjunction with data science. these change files would be relatively small and could be checked into git/approved and run on each person’s machine. we could then upload some benched marked versions of db in s3, that could be downloaded by anyone on the team through a direct link. this would allow for some review process along with some ability to re-sync.

as you said the question is cost/credentials combined with a question of the best way to collaborate around data.

znmeb [8:31 PM] Let me take another look at dat - that was billed as "Git for data". For now, just declaring the two .backup files we have as raw data and putting them in S3 for all to download should be enough. I've never used Cookiecutter before so I don't have a lot invested in it like I do PostGIS and Docker. https://datproject.org/ - we had somebody from the dat project working with us last year but nothing got built Dat Project Dat Project - A Distributed Data Community Dat is the nonprofit-backed technology & community for building apps of the future.

Michael Lange [8:35 PM] dat is pretty awesome, but I know that not every government associated entity is cool with p2p tech yet I used dat to transfer 80gb+ from my machine to gcp and it went without a hitch

znmeb [8:51 PM] I was thinking just within Hack Oregon for collaboration, not between Hack Oregon and the agencies But somebody would need to manage the server - isn't there a "tracker" like for Bittorrent? One way or another you've got a box on AWS holding the "ground truth"

DingoEatingFuzz commented 6 years ago

I extracted @iant01's extended archival plan to a new issue #49. With that, I believe this issue is done.

hackoregon / civic-devops

Research solution for uploading and distributing raw data #2