Advance guidance for using linked tables

chrismclarke commented 5 years ago

With the current setup data is imported once as raw (excel auto populates column types), cleaned and reimported to empty table copy, then used in querying.

The problem comes when defining table relationships as some of the constraints may no longer exist (e.g. removed form with duplicate farmer id but trial data still references the form).

Not sure whether it is best to keep track of deletions and manually remove from both places, try to link direct to db and use cascade rules on deletion, find a way to allow for errors or something else.

What do you think @dave-mills

chrismclarke commented 5 years ago

Hey @dave-mills , can I give you a poke on this one for feedback as now I remember this is where I became unstuck

dave-mills commented 5 years ago

I often use on delete cascade when building dbs before data collection/entry. It's neat and useful for avoiding those sort of traps, but you're right - not so useful for the workflow we're suggesting (exporting to Open Refine, re-importing after deleting outside the database).

Does Open Refine let you look at both tables, so you look at hh_info and see the related plot_data records? (I assume not). If not, then you probably need to keep track of the deletions you do in the hh_info. Then, when you look at plot_data, check that:

The records in plot_data that are linked to the deleted hh_info records are actually duplicates, and not things you need to keep.
If they are duplicates, delete them in Open Refine.
If they are not duplicates, then change their form_id so they get linked to the correct record.

Sounds like a hassle, but I think that's price to pay for using open refine (which doesn't see links between tables) and for being thorough and not accidentally deleting things that aren't exact duplicates!

Do you think that make sense?

(To be honest, it's good practice to keep a record of the changes you make / things you delete during data cleaning anyway, so your colleague can peer-review everything and in the future someone can take your raw data and recreate your analysis.)

chrismclarke commented 5 years ago

So the real problem is the way the data does/doesn't fit together.

There's a nice many->one mapping from plot_data to hh_info (assume same odk form) so we can build that link initially fine. The first challenge is that the hh_info has duplicate farmers (different data though), so I'm guessing the best option is to simply rename the farmernames (e.g. AEP088-old) and keep the full set of data in case it needs to be referenced or could be manually resolved.

The second challenge is the preference_data. As well as having duplicate values it also has references to lots of farmers that don't exist in the hh_info table, and lacks a number that do. I'm guessing we don't want to butcher the smallest possible subset, and so instead might want a separate 'farmers' table containing links to the hh_info form_id and preference_data generated primary key?

dave-mills commented 5 years ago

That is a good way of putting it. Commenting here mainly for record-keeping, as I think we had a good resolution on Skype:

setting up the plot_data -> hh_info relationship;
showing the process of de-duplicating the hh_info data;
tracking the changes so you can keep the plot_data in sync;
showing how and why it's impossible to link preference_data and hh_info (not linked via primary key on either side; no table has complete set of farmers).

Long term, I would strongly argue in favour of a seperate farmers table for this project. This could involve:

Create a new farmers table.
- Look through hh_info and identify any columns that are properties of the 'farmer', and not properties of the 'study observation' - e.g. county, village, gender, wealth, possibly gpshomestead. (Probably not ngo_name or officer, as they aren't properties of the farmer - rather properties of how the farmer interacts with the network / researchers).
- Add these columns to your farmers table. You might want to add more columns later, but these would be a good starting point.
Get the unique set of farmers from your hh_info and preference_data tables and populate your farmers table. For farmers in preference_data but not in hh_info, you'll just have the farmername field and nothing else - that's fine to start, and then you just have an ongoing task to try and complete this information from other sources.
Next, you can add your relationships:
- hh_info (many) -> farmers (one)
- preference_data (many) -> farmers (one)

This process could be hinted at in the case study, but not fully explained in this version (as it's a bit involved)

stats4sd / Installation-Guides

Advance guidance for using linked tables #66