Open chrismclarke opened 5 years ago
Hey @dave-mills , can I give you a poke on this one for feedback as now I remember this is where I became unstuck
I often use on delete cascade when building dbs before data collection/entry. It's neat and useful for avoiding those sort of traps, but you're right - not so useful for the workflow we're suggesting (exporting to Open Refine, re-importing after deleting outside the database).
Does Open Refine let you look at both tables, so you look at hh_info and see the related plot_data records? (I assume not). If not, then you probably need to keep track of the deletions you do in the hh_info. Then, when you look at plot_data, check that:
Sounds like a hassle, but I think that's price to pay for using open refine (which doesn't see links between tables) and for being thorough and not accidentally deleting things that aren't exact duplicates!
Do you think that make sense?
(To be honest, it's good practice to keep a record of the changes you make / things you delete during data cleaning anyway, so your colleague can peer-review everything and in the future someone can take your raw data and recreate your analysis.)
So the real problem is the way the data does/doesn't fit together.
There's a nice many->one mapping from plot_data to hh_info (assume same odk form) so we can build that link initially fine. The first challenge is that the hh_info has duplicate farmers (different data though), so I'm guessing the best option is to simply rename the farmernames (e.g. AEP088-old) and keep the full set of data in case it needs to be referenced or could be manually resolved.
The second challenge is the preference_data. As well as having duplicate values it also has references to lots of farmers that don't exist in the hh_info table, and lacks a number that do. I'm guessing we don't want to butcher the smallest possible subset, and so instead might want a separate 'farmers' table containing links to the hh_info form_id and preference_data generated primary key?
That is a good way of putting it. Commenting here mainly for record-keeping, as I think we had a good resolution on Skype:
Long term, I would strongly argue in favour of a seperate farmers
table for this project. This could involve:
Create a new farmers
table.
county
, village
, gender
, wealth
, possibly gpshomestead
. (Probably not ngo_name
or officer
, as they aren't properties of the farmer - rather properties of how the farmer interacts with the network / researchers). farmers
table. You might want to add more columns later, but these would be a good starting point. Get the unique set of farmers from your hh_info and preference_data tables and populate your farmers
table. For farmers in preference_data but not in hh_info, you'll just have the farmername
field and nothing else - that's fine to start, and then you just have an ongoing task to try and complete this information from other sources.
Next, you can add your relationships:
This process could be hinted at in the case study, but not fully explained in this version (as it's a bit involved)
With the current setup data is imported once as raw (excel auto populates column types), cleaned and reimported to empty table copy, then used in querying.
The problem comes when defining table relationships as some of the constraints may no longer exist (e.g. removed form with duplicate farmer id but trial data still references the form).
Not sure whether it is best to keep track of deletions and manually remove from both places, try to link direct to db and use cascade rules on deletion, find a way to allow for errors or something else.
What do you think @dave-mills