ncvo / charity-commission-extract

Python utilities for handling the import of data from the Charity Commission data extract
43 stars 21 forks source link

Linked Docker containers for this data? #3

Open psychemedia opened 7 years ago

psychemedia commented 7 years ago

Thanks for publishing these scripts. I was wondering if you also haveDockerfiles and docker-compose scripts that can build eg a MySQL database containing the data in one container, and then demonstrating R or python linked containers for analysing the data in the database?

drkane commented 7 years ago

That's a very good idea. I'm not entirely sure how to accomplish it, but I think the Dockerfile would probably need something like the following commands to run it:

RUN python /import.py /data/RegPlusExtract_January_2017.zip
RUN mysql  -e "CREATE DATABASE ccextract;"
RUN mysql  ccextract < table-definition.sql
RUN mysqlimport --local --ignore-lines=1 --fields-terminated-by=, --fields-optionally-enclosed-by="\"" --fields-escaped-by=\ --verbose --delete ccextract data/extract_acct_submit.csv data/extract_aoo_ref.csv data/extract_ar_submit.csv data/extract_charity.csv data/extract_charity_aoo.csv data/extract_class.csv data/extract_class_ref.csv data/extract_financial.csv data/extract_main_charity.csv data/extract_name.csv data/extract_objects.csv data/extract_partb.csv data/extract_registration.csv data/extract_remove_ref.csv data/extract_trustee.csv

(Would need proper login details added to the mysql commands).

Have you got an example of a similar container?

drkane commented 7 years ago

You might find this useful/interesting too: https://github.com/tithebarn/charity-base. Same data, but in MongoDB and with an API.

psychemedia commented 7 years ago

Closest I've got is PostgreSQL and that's muddled up in https://github.com/psychemedia/ou-tm351/tree/master/TM351_docker_Testing

I really need to do a simple worked teaching example for my own reference/reuse!

psychemedia commented 7 years ago

There are examples of getting node running in containers at eg https://nodejs.org/en/docs/guides/nodejs-docker-webapp/

drkane commented 7 years ago

That's useful, thanks. I'd be happy to collaborate on a worked example - tbh I find docker and particularly docker-compose confusing and struggle to get my head around it sometimes!

psychemedia commented 7 years ago

I made a start but hit issues I don't have time to debug right now:-( (Most of my own reusable scripts are for postgres or mongo).

Init script for use by the Docker file - this breaks when the build process tries to run it (can't connect):

The chinit.sh file:

#!/bin/bash

echo "Starting up..."

#I'm not sure if we need to start and stop the db daemon?
#Couldn't offhand find a Hello World demo that just worked..!
#http://stackoverflow.com/a/29150538/454773
/usr/sbin/mysqld &
#until mysqladmin ping >/dev/null 2>&1; do
#  echo -n "."; sleep 0.2
#done
sleep 5

echo "Creating database..."
mysql  -e "CREATE DATABASE ccextract;"
echo "...done"

echo "Extracting..."
mysql  ccextract < /tmp/table-definition.sql
echo "...done"

echo "Importing..."
mysqlimport --local --ignore-lines=1 --fields-terminated-by=, --fields-optionally-enclosed-by="\"" --fields-escaped-by=\ --verbose --delete ccextract data/extract_acct_submit.csv data/extract_aoo_ref.csv data/extract_ar_submit.csv data/extract_charity.csv data/extract_charity_aoo.csv data/extract_class.csv data/extract_class_ref.csv data/extract_financial.csv data/extract_main_charity.csv data/extract_name.csv data/extract_objects.csv data/extract_partb.csv data/extract_registration.csv data/extract_remove_ref.csv data/extract_trustee.csv
echo "...done"

#Do we need to shut down any db daemon we started?
#mysqladmin shutdown
echo "...and done"

Dockerfile to build the MySQL container - this can go in the same directory as other files, along with the data zip file ...

FROM mysql/mysql-server

#We need to copy (ADD) the files we're using from the local directory into the container image
ADD import.py ./import.py
ADD bcp.py ./bcp.py

ADD RegPlusExtract_March_2017.zip data/RegPlusExtract_March_2017.zip
ADD table-definition.sql /tmp/table-definition.sql

RUN python ./import.py data/RegPlusExtract_March_2017.zip

#Anything in /docker-entrypoint-initdb.d/ should run as an init script?
ADD chinit.sh /docker-entrypoint-initdb.d/chinit.sh

#Not sure if we should just explicitly run things directly, at least in testing!
#i.e. comment out the /docker-entrypoint-initdb.d/ ADD and replace with:
#ADD chinit.sh /chinit.sh
#RUN /chinit.sh

The following docker-compose.yml can also go in the same folder - run it with docker-compose up --build -d and it will build the MySQL container from the local Dockerfile, pull down the Jupyter notebook image/container from Dockerhub, start the two containers and link them together:

#docker-compose up --build -d
mysqlcc:
  build: .
  environment:
    MYSQL_ROOT_PASSWORD: charitable
  expose:
    - "3306"
jupytercc:
  image: jupyter/scipy-notebook
  links:
    - mysqlcc
  ports:
    - "8888"
  volumes:
    - .:/home/jovyan/work

Running a python2 notebook, we should then be able to connect using something like the following:

##Use a py2 notebook (most straightforward mysql py package doesn't work in py 3?)
!conda install -y mysql-python
import sqlalchemy
engine = sqlalchemy.create_engine('mysql://root:charitable@mysqlcc:3306')
engine.execute("USE ccextract")

However, on one occasion I did get the db running (not sure how...) I got a disallowed connection message (connection incoming from a disliked IP address (the IP address of the Jupyter notebook). So no idea what's happening there either... (I'm a bit out of practice with this docker stuff; maybe my memories of it "just working" are all false memories!)