Create Data Science Environment

hackoregon / transportation-systems

Hack Oregon Repository to develop code for the Transportation Theme of CIVIC.

MIT License

5 stars 2 forks source link

Create Data Science Environment #18

Closed bhgrant8 closed 4 years ago

bhgrant8 commented 6 years ago

[ ] Create a Postgres Docker Container
[ ] AWS S3 integration
[ ] Document and Define Data Science Stack/Repos - There are 5 different data science related repos right now for our team. I am not sure what they are do or what is intent.
[ ] Serve our data sets:

ODOT
Trimet Ridership
Triment Congestion

znmeb commented 6 years ago

Here's what I'm proposing:

Two services:

odot_crash_data - will contain the ODOT crash data.
passenger_census - will contain the ridership data; the name passenger_census comes from the CSV file we received.

Container port numbers and their host mappings and postgres user passwords will be set from a local .env file.

We need to define a mechanism for the Dockerfiles to acquire the input database dump files without the user having to download them. In other words, I want to be able to do a wget orcurl in the Dockerfile that runs at image build time, rather than a doing it with a Dockerfile COPY. This is something we have to get nailed down for DevOps / deployment anyway, so we might as well solve it this week. ;-) See https://github.com/hackoregon/civic-devops/issues/3.

BrianHGrant commented 6 years ago

I'll get some data on my personal dev s3 account and setup a billing alert and we can play around a little.

If we can get a proof of concept and cost idea, there would be pretty quick adoption I would imagine. This should be a priority in my mind because then we ensure we are working from the same data and saving manual hours updating.

znmeb commented 6 years ago

OK ... how does S3 authentication work? Is it like everything else (a PEM key, ssh-stuff?)

bhgrant8 commented 6 years ago

Access and secret key.

You will need to add the aws cli client to your DOCKERFILE:

RUN pip install --upgrade --user awscli

We did something similar to pull our secrets last year:

https://github.com/hackoregon/backend-service-pattern/blob/master/bin/getconfig.sh

Which was called in the entrypoint file:

https://github.com/hackoregon/backend-service-pattern/blob/master/bin/docker-entrypoint.sh

znmeb commented 6 years ago

Yeah - syncing with S3 is built into cookiecutter's data science template

bhgrant8 commented 6 years ago

Ok so I went ahead and setup the following access policy (actual bucket name is redacted):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "<ACTUAL ARN>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "<ACTUAL ARN>"
            ]
        }
    ]
}

I then attached this policy to a IAM group and created a user within it. Will provide creds through slack.

The creds will work for either a docker or cookiecutter setup as you wish. it looks like cookiecutter is using the sync command from the cli:

https://github.com/drivendata/cookiecutter-data-science/blob/master/%7B%7B%20cookiecutter.repo_name%20%7D%7D/Makefile#L47

it looks like we may need to name the folder within the bucket as "data"?

znmeb commented 6 years ago

I'm hacking away on this in https://github.com/hackoregon/data-science-pet-containers. It's just about where I want it, so I'm planning a "formal release" later this week.

I'm testing a utility called rclone (https://rclone.org/) for the cloud syncing. It's available in all the Linux distros, including Debian. It seems to be well maintained and will sync just about anywhere, not just S3. But IMHO it is not suitable for deployment, just for desktops. It's interactive and its secrets management scheme would probably rule out its use even in self-managed servers.

znmeb commented 6 years ago

I put this on the back burner for the Tech Challenge but I'm back on it. I just have one major documentation task and another example scenario to do.