Closed bhgrant8 closed 4 years ago
Here's what I'm proposing:
Two services:
odot_crash_data
- will contain the ODOT crash data.passenger_census
- will contain the ridership data; the name passenger_census
comes from the CSV file we received.Container port numbers and their host mappings and postgres
user passwords will be set from a local .env file.
We need to define a mechanism for the Dockerfiles to acquire the input database dump files without the user having to download them. In other words, I want to be able to do a wget
orcurl
in the Dockerfile that runs at image build time, rather than a doing it with a Dockerfile COPY. This is something we have to get nailed down for DevOps / deployment anyway, so we might as well solve it this week. ;-) See https://github.com/hackoregon/civic-devops/issues/3.
I'll get some data on my personal dev s3 account and setup a billing alert and we can play around a little.
If we can get a proof of concept and cost idea, there would be pretty quick adoption I would imagine. This should be a priority in my mind because then we ensure we are working from the same data and saving manual hours updating.
OK ... how does S3 authentication work? Is it like everything else (a PEM key, ssh-stuff?)
Access and secret key.
You will need to add the aws cli client to your DOCKERFILE:
RUN pip install --upgrade --user awscli
We did something similar to pull our secrets last year:
https://github.com/hackoregon/backend-service-pattern/blob/master/bin/getconfig.sh
Which was called in the entrypoint file:
https://github.com/hackoregon/backend-service-pattern/blob/master/bin/docker-entrypoint.sh
Yeah - syncing with S3 is built into cookiecutter's data science template
Ok so I went ahead and setup the following access policy (actual bucket name is redacted):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"<ACTUAL ARN>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": [
"<ACTUAL ARN>"
]
}
]
}
I then attached this policy to a IAM group and created a user within it. Will provide creds through slack.
The creds will work for either a docker or cookiecutter setup as you wish. it looks like cookiecutter is using the sync command from the cli:
it looks like we may need to name the folder within the bucket as "data"?
I'm hacking away on this in https://github.com/hackoregon/data-science-pet-containers. It's just about where I want it, so I'm planning a "formal release" later this week.
I'm testing a utility called rclone
(https://rclone.org/) for the cloud syncing. It's available in all the Linux distros, including Debian. It seems to be well maintained and will sync just about anywhere, not just S3. But IMHO it is not suitable for deployment, just for desktops. It's interactive and its secrets management scheme would probably rule out its use even in self-managed servers.
I put this on the back burner for the Tech Challenge but I'm back on it. I just have one major documentation task and another example scenario to do.
[ ] Create a Postgres Docker Container
[ ] AWS S3 integration
[ ] Document and Define Data Science Stack/Repos - There are 5 different data science related repos right now for our team. I am not sure what they are do or what is intent.
[ ] Serve our data sets: