Open ruta-goomba opened 9 years ago
Looking at the repository for the cadastre data, it looks like the updates have become more frequent. They are now done on a weekly basis instead of monthly.
Another question, would it be possible to do a highlight on what has changed when a dataset is updated within the database? for example some sort of DIFF. For example an investigative journalist would be interested to see what concessions have changed ownership since the last update, that's probably the only thing that would need to be highlighted..
It doesn't really matter how often things change. We can have a cron job that gets executed every night in the server and checks if there's a new database dump in flexicadastre that is different to the copy we store internally in the server. If it is not different, then that's it. Otherwise, we'll need to extract the data, convert it to the new DB schema (for this we need to know how you've created the new DB schema, was it from the old one or in a manual way?), and then send it to CartoDB. This will imply having a text file of some kind (CSV, SQL, etc), which can be version controlled and hence you can visualise diffs.
Would this serve your purpose? Perhaps useful to discuss in our CartoDB walkthrough call? See #10
@raquel-ucl : the DB schema was imported from the original one that was shared sometime back. The schema has not changed since then.
But how did you create the DB schema? Did you write it from scratch (i.e. inserted all the values manually in the new schema) or did you write a script or something that gets the data from the flexicadastre (I assume NA.json) and convert it to an sql schema or similar?
Yes the schema was created from scratch and the values were inserted manually or imported via the csv file.
Oh, you have CSV file with the data? That's awesome. How did you build that CSV or where did you get it from?
The CSV file is here https://docs.google.com/spreadsheets/d/1Mt1ileAXaFpQfEJ368-3nQRPhg8HI7r-j-O0RjzKCUs/edit#gid=312265893 . This is the file that was populated by the data collectors.
Got it, thanks!
I see it says This data is no longer needed as it can be drawn from the cadastre scrape - there may be some historical blocks that we want to keep data on
. Is this outdated then? There's also some redundant fields which make me think this data was generated programatically. I might be wrong, but otherwise do you have the code that did so?
Just making a note for myself: I've created a bash script that syncs a folder in the server with the data in the online Flexicadastre in a new branch (feature/dataSync). I'll make it into a cronjob later. Also it'll need something like this to detect when a new file is downloaded:
inotifywait -m /path -e create -e moved_to |
while read path action file; do
echo "The file '$file' appeared in directory '$path' via '$action'"
# do something with the file
done
I've added the job to the crontab. It'll be executed every midnight. Right now it's in /home/dev/git/IPPR/utils/SyncFlexicadastre.sh
By now it only synchronises the data in /home/dev/data/flexicadastre with the contents of http://data.pudo.org/flexicadastre. There's a few other TODOs in the script (keeping a log, etc.)
To get in the server:
ssh dev@45.55.35.212
password is in the "Accounts" email.
To modify the crontab:
crontab -e
To see the contents of the crontab:
crontab -l
Just got this feedback from the person who put together the database schema.
He used SQL and interactive Ruby to modify the data to fit the new schema in an ad-hoc way. So it wasn’t entirely manual, but also don’t have any useful scripts. He assumed that in the production environment, much of the live data would be coming from the monthly cadastre scrapes, so we’d need to write import scripts specifically to deal with this. There also wasn’t so much data in the sample spreadsheet that he couldn’t normalize most of it by hand and with a few SQL queries where necessary.
OK, then we'll need to set up a call later on to make clear how to match from flexicadastre data to the new DB. I suggest to stick to the 4 tables needed by the License Holders page by now, then once we do that it'll be easier to progress on it.
Note to self: @ruta-goomba mentioned my plans for the cronjob and the python script can all be done in NodeJS. Need to explore this option as well.
Not working at the moment, needs investigation and possibly changing wget to a more complex thing. Might be worth forget about shell and just do a python script.
@raquel-ucl Does this need to move to the next milestone or will it be completed in the current one?
I'd move it, it's not really needed yet since we are focusing on developing stuff for the current sample of data in CartoDB. It's still unclear how all this data insertion/update will work in the future. There's no clear requirements on it, and the mapping between the data in the flexicadastre and the SQL schema we have is not been given to us yet. I was planning on mapping the columns that look the same, but some other have been added just for the new schema with data from elsewhere. I was just trying to make a system that notifies us when there's a new bunch of data, extracts it and compares with the previous one, and highlight changes that need to be changed in the CartoDB database. This work is to be done in parallel and doesn't really impact anywhere else in the code. I'd say it's low priority by now.
@raquel-ucl : I did some mapping here #74 some time back would that help with this process?
Going to move this to the next milestone and lower the priority per the comments above. We can re-prioritize the tickets scoped in to Etosha before we begin work on that.
Raquel: ' The cron job should update the data in CartoDB. Then, in turn, response.json can be populated, but always from CartoDB, not directly done in the cronjob. There is a way to insert data in CartoDB from a script using UPDATE SQL queries. For example: http://gis.stackexchange.com/questions/94982/what-is-the-simplest-way-to-write-data-to-cartodb-using-python
It's also posible to do it with NodeJS, and I'm sure other technologies. We can decide later on which to choose.
Does this sound fine?
The data flow would be: