Isolate the data acquisition from the data publication

jufemaiz commented 6 years ago

At the moment, nemweb mashes together data acquisition (e.g. download zip, extract, process) along with the data publication (persistence to nemweb_sqlite).

I am proposing that we at a minimum separate the two different processes and have a processor. This way, there is the potential to use other storage solutions (including publication to a queue for writing).

Thoughts?

dylanjmcconnell commented 6 years ago

Hey,

At the moment there there is data server at unimelb running the backend. The python scripts actually interact with a mysql database (rather than sqlite db).

I made a very simple sqlite interface, basically because I thought it would be more useful (or user friendly) than requiring someone to set up a mysql server. The mysql server is pretty strict (i.e. normalised, foreign key constraints etc) - and quite large... Some series go back to the start of the NEM.

There is some degree of abstraction from the mysql interface (...which uses sqlalchemy) and the the downloading / processing - but there is also a fair bit of interaction between the download/processing and the data persistence (since there is a degree of mapping between primary key tables in the mysql db and the downloaded files. If that makes sense)... I think I did try separating it out completely once before (but I gave up).

Longer term - I was thinking of running the python scripts on an EC2 server, and using Amazon RDS (rather than the unimelb data server). .. The web front end is on S3 btw. Even better / longer term would be a docker container - but that's a looong way down the track I think.

Am open to suggestions on all of this - but that's where my thinking is at the moment. Have some local branches for interfacing with mysql etc (which I'll eventually push to the repo when I am not entirely embarrassed by them). But yeah, In the mean time, have only got the light weight sqlite interface in the master repo..

p.s. looking to add you to slack (already have one) but am not the workspace 'owner'

Cheers, Dylan

jufemaiz commented 6 years ago

Sweet! Ok thanks for a bit more information on this. I've got some ideas that I'll try and put down to throw at you.

I was just going through the repo to try and bring some pytests & pylints and the above was my first impression. Funny you mention docker because i've already got the container part working. Insofar as the database & data interactions, I've got some ideas there too to try and make this scalable + ultra cheap to run!

opennem / nemweb

Isolate the data acquisition from the data publication #3