mianzg commented 2 months ago

Hello @efvicario , the current data dictionary is based on your capstone project information.

Please kindly review and revise the file : https://github.com/openwashdata/watercostaccra/blob/main/data-raw/dictionary.csv

If you want to provide any revision, please refer to the following post. This way you will appear as a Contributor at the homepage of this GitHub repository too!

mianzg commented 2 months ago

I prepare the following steps:

Go to https://github.com/openwashdata/watercostaccra/blob/main/data-raw/dictionary.csv and Download the file.

Open your downloaded dictionary.csv with Excel
Edit the column description for all the rows you want to revise, and SAVE it to csvfile
Go to https://github.com/openwashdata/watercostaccra/tree/main/data-raw and Select Add File/Upload Files

github_add_file

Upload a. Upload the revised dictionary.csv file b. Write the commit message c. Select the option "Create a new branch", write the name as "dictionary" d. Click the green button to Propose the change.

upload

efvicario commented 2 months ago

Hi Mian, I proposed edits for the dictionary and the datasets. Both datasets had extra cells at the bottom that I forgot to delete. In the readme file, two questions-

did you mean to include the watercost2 descriptions/example?
the description for housing type looks weird - the [ is showing up as into [ so it looks like this:

"housing type ([1] block unit: unit in a row of apartments made of cement blocks, [2] wood unit: unit in a row of apartments made of wood, house, [3] compound house: single-story L- or C-shaped house with a multiple units around a shared courtyard, [4] multi-story apartment building, [5] wooden shack, [6] no structure, [7] other)"

thanks, Elizabeth

mianzg commented 2 months ago

Hi Mian, I proposed edits for the dictionary and the datasets. Both datasets had extra cells at the bottom that I forgot to delete. In the readme file, two questions-

did you mean to include the watercost2 descriptions/example?

the description for housing type looks weird - the [ is showing up as into [ so it looks like this:

"housing type ([1] block unit: unit in a row of apartments made of cement blocks, [2] wood unit: unit in a row of apartments made of wood, house, [3] compound house: single-story L- or C-shaped house with a multiple units around a shared courtyard, [4] multi-story apartment building, [5] wooden shack, [6] no structure, [7] other)"

thanks, Elizabeth

@efvicario Thanks for your revision!

To question 1, it would be nice to include one! Do you have any ideas? To question 2, I changed the format. We need to work more on the documentation of the categorical variables. I am working in progress and please refer to the below threads.

Also, I would suggest to rename the two datasets to something like "survey" and "waterpoints" instead of "watercostaccra1" and "watercostaccra2". These current names are not very intuitive. What do you think? If renaming, what names do you want?

efvicario commented 2 months ago

@mianzg Sure, renaming would definitely be more intuitive. How about "households" and "waterpoints"?

For the waterpoint data example, maybe a simple bar chart comparison of price per liter ("avg_price_per_liter_cedis") between communities?

Elizabeth

mianzg commented 2 months ago

Revise categorical variables

Incomplete: Some categorical variables have more available options than the values in response, should we make these factor variables with full levels?
Reorder: Some are ordinal categorical variables, should we make the levels with the order?

I will use "incomplete" and "reorder" to describe the issues in the below.

Households

business_ownership: why there is a category "no", what does that mean?
business_water_source: incomplete, to give an example on this issue, we only have "commercial tap" "packaged" "piped to compound" in collected values, but the dictionary indicates more.
primary_dw_source: incomplete
dw_treatment: reorder?
primary_water_source: incomplete
time_of_last_struggle_to_find_water: reorder

Waterpoints

available_services: it looks like a multi-selection question, need to expand to multiple columns maybe
managers: same as above
perception: reorder
tap_closure_changes: incomplete
CBT_sample_source: Can we remove all content in parentheses? "indirect_fromtap(traveled_through_hose)" "otherstorage(traveled_through_hose_or_poured_through_container)"
coli_mpn_health_risk:
- reorder
- why there is "probably_unsafe" and "possibly_unsafe"
tc_mpn_health_risk: same as above

efvicario commented 1 month ago

Revise categorical variables

Incomplete: Some categorical variables have more available options than the values in response, should we make these factor variables with full levels? I think it's ok to delete the unused categories from the descriptions. I will publish something linking to this data and I can cross ref that paper which has the survey instrument if people want to see all the categories that were used.

Reorder: Some are ordinal categorical variables, should we make the levels with the order?

I will use "incomplete" and "reorder" to describe the issues in the below.

Households

business_ownership: why there is a category "no", what does that mean? I guess this should be n/a, it means neither the household nor respondent has a business

business_water_source: incomplete, to give an example on this issue, we only have "commercial tap" "packaged" "piped to compound" in collected values, but the dictionary indicates more. These were the options I had in the survey so I included them all in the dictionary, but I can delete the unused categories for all these "incomplete" variables below.

primary_dw_source: incomplete I can delete extra categories in the dictionary

dw_treatment: reorder? I suppose there are levels to this, in order of effectiveness (i.e., none --> settle --> boil) but I'm not sure it matters much since most people did no treatment. Also one person did multiple treatments and it's listed as "boil;settle" but maybe there should just be a category "multiple_methods" ?

primary_water_source: incomplete I can delete extra categories in the dictionary

time_of_last_struggle_to_find_water: reorder Agreed, should be n/a --> over a year ago --> last year --> last 30 days --> last 7 days --> last 3 days

Waterpoints

available_services: it looks like a multi-selection question, need to expand to multiple columns maybe

managers: same as above Agreed on both counts, sorry that was my oversight!

perception: reorder Agreed, low --> acceptable --> high

tap_closure_changes: incomplete I think this is complete, there are only three options, but this should be multiple selections. I will edit the dictionary and answer choices to be more clear, also there is a mistake in one answer choice (should be 20 liter, not 2 liter bucket)

CBT_sample_source: Can we remove all content in parentheses? "indirect_fromtap(traveled_through_hose)" "otherstorage(traveled_through_hose_or_poured_through_container)" Sorry, another oversight, I'll correct it

coli_mpn_health_risk:

reorder

why there is "probably_unsafe" and "possibly_unsafe"

tc_mpn_health_risk: same as above Agreed to reorder these. The classifications were created by the company that made the test, I can include a link to that company's instructions for use (https://assets.ctfassets.net/vcps67yikf8u/5IbwfssqfSWqCo0U88GCAw/4ef1a9606f22cba7d79705ba3d096956/CBT_Instructions_EN.pdf)

I will make some edits in the dictionary and data files and reupload soon. Thanks!

Elizabeth

mianzg commented 1 month ago

@bonschorno @larnsce Just want to tag you here again about our nice exchange about categorical variables of surveys. There are two points from documenting the variables.

The survey is designed with more possible choices than the final responses. Luckily, this is noticed because the author provides the first version of the data dictionary. When I reviewed the data itself, I could not find the choice and was stuck in editing the data dictionary.
Multi-select survey questions are hard to identify, and after identifying, hard to clean. For instance, there is a variable managers recording the typical managers of water points. It looks like this

mianzg commented 1 month ago

@efvicario Sorry for my late response. I am working on the "reorder" variables. For the "incomplete" ones, I would keep all the possible options and insert them into data-processing.R too. That means, if we plot a bar plot, these options will appear with a zero count.

However, I'm less experienced in analysis of these qualitative data and would like to know why you want to simply remove them.

efvicario commented 1 month ago

Hi @mianzg,

Either way makes sense to me. The reason for including the questions in the dictionary is to show that we had other options besides what was given by respondents. It could be valuable to include them for some questions, for example, the types of water sources, since those answer choices are based on JMP classifications.

However, I'm writing a paper that explains the survey methodology and the survey will be available (open access) as part of the supplementary information. So, we can simply link to that paper for people to see the entire survey and all the potential answer choices. That would make the dataset simpler.

Elizabeth

openwashdata / watercostaccra

Review data dictionary and update variable description #2

Revise categorical variables

Households

Waterpoints

Revise categorical variables

Households

Waterpoints