openwashdata / watercostaccra

Data of household survey on water costs and coping strategies in Accra associated with a project report completed by Elizabeth Vicario for the “data science for openwashdata” course
https://openwashdata.github.io/watercostaccra/
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

Review data dictionary and update variable description #2

Open mianzg opened 2 months ago

mianzg commented 2 months ago

Hello @efvicario , the current data dictionary is based on your capstone project information.

Please kindly review and revise the file : https://github.com/openwashdata/watercostaccra/blob/main/data-raw/dictionary.csv

If you want to provide any revision, please refer to the following post. This way you will appear as a Contributor at the homepage of this GitHub repository too!

mianzg commented 2 months ago

I prepare the following steps:

  1. Go to https://github.com/openwashdata/watercostaccra/blob/main/data-raw/dictionary.csv and Download the file.
download
  1. Open your downloaded dictionary.csv with Excel

  2. Edit the column description for all the rows you want to revise, and SAVE it to csvfile

  3. Go to https://github.com/openwashdata/watercostaccra/tree/main/data-raw and Select Add File/Upload Files

github_add_file

  1. Upload a. Upload the revised dictionary.csv file b. Write the commit message c. Select the option "Create a new branch", write the name as "dictionary" d. Click the green button to Propose the change.

upload

efvicario commented 2 months ago

Hi Mian, I proposed edits for the dictionary and the datasets. Both datasets had extra cells at the bottom that I forgot to delete. In the readme file, two questions-

  1. did you mean to include the watercost2 descriptions/example?
  2. the description for housing type looks weird - the [ is showing up as into [ so it looks like this:

"housing type ([1] block unit: unit in a row of apartments made of cement blocks, [2] wood unit: unit in a row of apartments made of wood, house, [3] compound house: single-story L- or C-shaped house with a multiple units around a shared courtyard, [4] multi-story apartment building, [5] wooden shack, [6] no structure, [7] other)"

thanks, Elizabeth

mianzg commented 2 months ago

Hi Mian, I proposed edits for the dictionary and the datasets. Both datasets had extra cells at the bottom that I forgot to delete. In the readme file, two questions-

  1. did you mean to include the watercost2 descriptions/example?
  2. the description for housing type looks weird - the [ is showing up as into [ so it looks like this:

"housing type ([1] block unit: unit in a row of apartments made of cement blocks, [2] wood unit: unit in a row of apartments made of wood, house, [3] compound house: single-story L- or C-shaped house with a multiple units around a shared courtyard, [4] multi-story apartment building, [5] wooden shack, [6] no structure, [7] other)"

thanks, Elizabeth

@efvicario Thanks for your revision!

To question 1, it would be nice to include one! Do you have any ideas? To question 2, I changed the format. We need to work more on the documentation of the categorical variables. I am working in progress and please refer to the below threads.

Also, I would suggest to rename the two datasets to something like "survey" and "waterpoints" instead of "watercostaccra1" and "watercostaccra2". These current names are not very intuitive. What do you think? If renaming, what names do you want?

efvicario commented 2 months ago

@mianzg Sure, renaming would definitely be more intuitive. How about "households" and "waterpoints"?

For the waterpoint data example, maybe a simple bar chart comparison of price per liter ("avg_price_per_liter_cedis") between communities?

Elizabeth

mianzg commented 2 months ago

Revise categorical variables

  1. Incomplete: Some categorical variables have more available options than the values in response, should we make these factor variables with full levels?
  2. Reorder: Some are ordinal categorical variables, should we make the levels with the order?

I will use "incomplete" and "reorder" to describe the issues in the below.

Households

Waterpoints

efvicario commented 1 month ago

Revise categorical variables

  1. Incomplete: Some categorical variables have more available options than the values in response, should we make these factor variables with full levels? I think it's ok to delete the unused categories from the descriptions. I will publish something linking to this data and I can cross ref that paper which has the survey instrument if people want to see all the categories that were used.
  2. Reorder: Some are ordinal categorical variables, should we make the levels with the order?

I will use "incomplete" and "reorder" to describe the issues in the below.

Households

  • business_ownership: why there is a category "no", what does that mean? I guess this should be n/a, it means neither the household nor respondent has a business
  • business_water_source: incomplete, to give an example on this issue, we only have "commercial tap" "packaged" "piped to compound" in collected values, but the dictionary indicates more. These were the options I had in the survey so I included them all in the dictionary, but I can delete the unused categories for all these "incomplete" variables below.
  • primary_dw_source: incomplete I can delete extra categories in the dictionary
  • dw_treatment: reorder? I suppose there are levels to this, in order of effectiveness (i.e., none --> settle --> boil) but I'm not sure it matters much since most people did no treatment. Also one person did multiple treatments and it's listed as "boil;settle" but maybe there should just be a category "multiple_methods" ?
  • primary_water_source: incomplete I can delete extra categories in the dictionary
  • time_of_last_struggle_to_find_water: reorder Agreed, should be n/a --> over a year ago --> last year --> last 30 days --> last 7 days --> last 3 days

Waterpoints

  • available_services: it looks like a multi-selection question, need to expand to multiple columns maybe
  • managers: same as above Agreed on both counts, sorry that was my oversight!
  • perception: reorder Agreed, low --> acceptable --> high
  • tap_closure_changes: incomplete I think this is complete, there are only three options, but this should be multiple selections. I will edit the dictionary and answer choices to be more clear, also there is a mistake in one answer choice (should be 20 liter, not 2 liter bucket)
  • CBT_sample_source: Can we remove all content in parentheses? "indirect_fromtap(traveled_through_hose)" "otherstorage(traveled_through_hose_or_poured_through_container)" Sorry, another oversight, I'll correct it
  • coli_mpn_health_risk:

    • reorder
    • why there is "probably_unsafe" and "possibly_unsafe"
  • tc_mpn_health_risk: same as above Agreed to reorder these. The classifications were created by the company that made the test, I can include a link to that company's instructions for use (https://assets.ctfassets.net/vcps67yikf8u/5IbwfssqfSWqCo0U88GCAw/4ef1a9606f22cba7d79705ba3d096956/CBT_Instructions_EN.pdf)

I will make some edits in the dictionary and data files and reupload soon. Thanks!

Elizabeth

mianzg commented 1 month ago

@bonschorno @larnsce Just want to tag you here again about our nice exchange about categorical variables of surveys. There are two points from documenting the variables.

  1. The survey is designed with more possible choices than the final responses. Luckily, this is noticed because the author provides the first version of the data dictionary. When I reviewed the data itself, I could not find the choice and was stuck in editing the data dictionary.

  2. Multi-select survey questions are hard to identify, and after identifying, hard to clean. For instance, there is a variable managers recording the typical managers of water points. It looks like this image

mianzg commented 1 month ago

@efvicario Sorry for my late response. I am working on the "reorder" variables. For the "incomplete" ones, I would keep all the possible options and insert them into data-processing.R too. That means, if we plot a bar plot, these options will appear with a zero count.

However, I'm less experienced in analysis of these qualitative data and would like to know why you want to simply remove them.

efvicario commented 1 month ago

Hi @mianzg,

Either way makes sense to me. The reason for including the questions in the dictionary is to show that we had other options besides what was given by respondents. It could be valuable to include them for some questions, for example, the types of water sources, since those answer choices are based on JMP classifications.

However, I'm writing a paper that explains the survey methodology and the survey will be available (open access) as part of the supplementary information. So, we can simply link to that paper for people to see the entire survey and all the potential answer choices. That would make the dataset simpler.

Elizabeth