namibmap / IPPR

2 stars 1 forks source link

Create daily Cron Job to detect flexicadastre changes #42

Open ruta-goomba opened 8 years ago

ruta-goomba commented 8 years ago

Raquel: ' The cron job should update the data in CartoDB. Then, in turn, response.json can be populated, but always from CartoDB, not directly done in the cronjob. There is a way to insert data in CartoDB from a script using UPDATE SQL queries. For example: http://gis.stackexchange.com/questions/94982/what-is-the-simplest-way-to-write-data-to-cartodb-using-python

It's also posible to do it with NodeJS, and I'm sure other technologies. We can decide later on which to choose.

Does this sound fine?

The data flow would be:

Flexicadastre changes
Cronjob run daily (for example) detects flexicadastre changed
Cronjob calls script to update the tables in CartoDB
response.json gets reloaded. This depends on where the queries are done from. If it's from the front end, then response.json is not really necessary at all. As soon as the data is updated, the next user that load the map will get the latest data. If we want to do it from the back-end, we just rewrite response.json.'
lbewlay commented 8 years ago

Looking at the repository for the cadastre data, it looks like the updates have become more frequent. They are now done on a weekly basis instead of monthly.

Another question, would it be possible to do a highlight on what has changed when a dataset is updated within the database? for example some sort of DIFF. For example an investigative journalist would be interested to see what concessions have changed ownership since the last update, that's probably the only thing that would need to be highlighted..

raquelalegre commented 8 years ago

It doesn't really matter how often things change. We can have a cron job that gets executed every night in the server and checks if there's a new database dump in flexicadastre that is different to the copy we store internally in the server. If it is not different, then that's it. Otherwise, we'll need to extract the data, convert it to the new DB schema (for this we need to know how you've created the new DB schema, was it from the old one or in a manual way?), and then send it to CartoDB. This will imply having a text file of some kind (CSV, SQL, etc), which can be version controlled and hence you can visualise diffs.

Would this serve your purpose? Perhaps useful to discuss in our CartoDB walkthrough call? See #10

lbewlay commented 8 years ago

@raquel-ucl : the DB schema was imported from the original one that was shared sometime back. The schema has not changed since then.

raquelalegre commented 8 years ago

But how did you create the DB schema? Did you write it from scratch (i.e. inserted all the values manually in the new schema) or did you write a script or something that gets the data from the flexicadastre (I assume NA.json) and convert it to an sql schema or similar?

lbewlay commented 8 years ago

Yes the schema was created from scratch and the values were inserted manually or imported via the csv file.

raquelalegre commented 8 years ago

Oh, you have CSV file with the data? That's awesome. How did you build that CSV or where did you get it from?

lbewlay commented 8 years ago

The CSV file is here https://docs.google.com/spreadsheets/d/1Mt1ileAXaFpQfEJ368-3nQRPhg8HI7r-j-O0RjzKCUs/edit#gid=312265893 . This is the file that was populated by the data collectors.

raquelalegre commented 8 years ago

Got it, thanks!

I see it says This data is no longer needed as it can be drawn from the cadastre scrape - there may be some historical blocks that we want to keep data on. Is this outdated then? There's also some redundant fields which make me think this data was generated programatically. I might be wrong, but otherwise do you have the code that did so?

raquelalegre commented 8 years ago

Just making a note for myself: I've created a bash script that syncs a folder in the server with the data in the online Flexicadastre in a new branch (feature/dataSync). I'll make it into a cronjob later. Also it'll need something like this to detect when a new file is downloaded:

inotifywait -m /path -e create -e moved_to |
    while read path action file; do
        echo "The file '$file' appeared in directory '$path' via '$action'"
        # do something with the file
    done
raquelalegre commented 8 years ago

I've added the job to the crontab. It'll be executed every midnight. Right now it's in /home/dev/git/IPPR/utils/SyncFlexicadastre.sh

By now it only synchronises the data in /home/dev/data/flexicadastre with the contents of http://data.pudo.org/flexicadastre. There's a few other TODOs in the script (keeping a log, etc.)

To get in the server:

ssh dev@45.55.35.212 password is in the "Accounts" email.

To modify the crontab:

crontab -e

To see the contents of the crontab:

crontab -l

lbewlay commented 8 years ago

Just got this feedback from the person who put together the database schema.

He used SQL and interactive Ruby to modify the data to fit the new schema in an ad-hoc way. So it wasn’t entirely manual, but also don’t have any useful scripts. He assumed that in the production environment, much of the live data would be coming from the monthly cadastre scrapes, so we’d need to write import scripts specifically to deal with this. There also wasn’t so much data in the sample spreadsheet that he couldn’t normalize most of it by hand and with a few SQL queries where necessary.

raquelalegre commented 8 years ago

OK, then we'll need to set up a call later on to make clear how to match from flexicadastre data to the new DB. I suggest to stick to the 4 tables needed by the License Holders page by now, then once we do that it'll be easier to progress on it.

raquelalegre commented 8 years ago

Note to self: @ruta-goomba mentioned my plans for the cronjob and the python script can all be done in NodeJS. Need to explore this option as well.

raquelalegre commented 8 years ago

Not working at the moment, needs investigation and possibly changing wget to a more complex thing. Might be worth forget about shell and just do a python script.

missfunmi commented 8 years ago

@raquel-ucl Does this need to move to the next milestone or will it be completed in the current one?

raquelalegre commented 8 years ago

I'd move it, it's not really needed yet since we are focusing on developing stuff for the current sample of data in CartoDB. It's still unclear how all this data insertion/update will work in the future. There's no clear requirements on it, and the mapping between the data in the flexicadastre and the SQL schema we have is not been given to us yet. I was planning on mapping the columns that look the same, but some other have been added just for the new schema with data from elsewhere. I was just trying to make a system that notifies us when there's a new bunch of data, extracts it and compares with the previous one, and highlight changes that need to be changed in the CartoDB database. This work is to be done in parallel and doesn't really impact anywhere else in the code. I'd say it's low priority by now.

lbewlay commented 8 years ago

@raquel-ucl : I did some mapping here #74 some time back would that help with this process?

missfunmi commented 8 years ago

Going to move this to the next milestone and lower the priority per the comments above. We can re-prioritize the tickets scoped in to Etosha before we begin work on that.