Closed vhurtadol closed 3 years ago
Thanks for this @vhurtadol, @paninee is on a short holiday but will be back next week and can have a look at this.
Thanks! Here are the two other issues I found:
First, there are some cases where negative and missing values might mean the same but are coded separately. For _scope_ofinfluence, we have NA and also “no_geo”. I am assuming they mean the same, but are coded differently.
For _publicspectrum, we have both not and NA. Since this refers to the “Goal from the IAP2 Spectrum of Public Participation that the case best represents”, I think no is the same as NA. However, it can also be that this decision was made to know how many people skip the question and how many actually say that this does not apply to the case they are writing about.
This is the same issue for: • recruitment_method • decision_methods_1 : 4 • if_voting_1 : 4 • implementers_of_change_1 : 5
Second, we have assigned both NA and n/a for missing values in the same column. For example, for facilitators:
The same issue applies to these columns: • insights_outcomes_1 • organizer_types_1 • funder_types_1
Finally, we should change the coding for _learning_resources1 to _learning_resources5. Here, we have all three options: no_info, not, and NA.
The list of columns where this is an issue: • general_issues_4 • general_issues_5 • specific_topics_4 • targeted_participants_4 • targeted_participants_5 • implementers_of_change_4
Thanks Vero! @ascott if you agree, Pan can work on this issue before starting on #1006
@jesicarson yes sounds good!
@vhurtadol Question - If we included a row at the top of the CSV, would that mess up using the CSV in R? We'd like to include the search query (if its a CSV of search results) + download date + codebook link. Please let me know if this makes sense, and we can get pan to implement. (this suggestion is an alternative to adding a tab with the full codebook, as CSV's don'ts support tabs)
Hi @jesicarson, it wouldn't mess the CSV in R. Probably the researcher would just delete the top row before opening the file in R or Stata. I think it's a fine alternative if we want it all on the same file. However, having it as a separate link (as you last suggested) also works very well.
@vhurtadol Are you using https://participedia.net/search?selectedCategory=case&returns=csv to download the CSV? I don't see the issues with the wrong column. Here is what I downloaded. participedia-data-cases.csv.zip
@vhurtadol @ascott @jesicarson For Q1, the value <N/A>
was probably generated by R. They are empty values. None of these fields are required. So if the user didn't select any value it would be blank. For example, <N/A>
and no_geo
are not the same. <N/A>
just means the user didn't select any _scope_ofinfluence.
@vhurtadol @ascott for #2, we have some old data that has more than the limitation of 3 values. I remember we want to standardize the multi-select fields to 5 fields (https://github.com/participedia/api/issues/983). Should we change the frontend accordingly?
Hi @zphingphong, yes. I downloaded the CVS from https://participedia.net/search?selectedCategory=case&returns=csv Here is what I downloaded. participedia-data-cases-11.csv.zip
@vhurtadol @ascott @jesicarson For Q1, the value
<N/A>
was probably generated by R. They are empty values. None of these fields are required. So if the user didn't select any value it would be blank. For example,<N/A>
andno_geo
are not the same.<N/A>
just means the user didn't select any _scope_ofinfluence.
Yes, R assigned the NA when there is a missing value (no response to that question). I'm glad it was clarified that no_geo means something different.
Then, I think we can just focus on the second group of cases, where there is manual coding of n/a and the automatic coding of NA by R.
remaining todos:
n/a
, not_applicable
, not
, no
, not relevant
and change to null or recode to more clear values @ascott @jesicarson a
value from 1 entry @paninee @ascott Just got confirmation from @plscully that we can delete any old data for general issues beyond the 3rd response to clean up the csv, and keep the limit to 3 on the frontend as is.
I truncated the multi-select fields data to the same limitation as the UI form. Here is the list of original records I updated for future reference.
hi @vhurtadol , pan and i both reviewed the CSV files and are not seeing the issues with the 49 columns that you are seeing that you mention above. https://github.com/participedia/api/issues/1007#issue-746314103. we have opened the csv that you have attached here as well as directly from participedia using Excel, Numbers and google sheets and are not able to reproduce the issue you see. What application are you using to view the csv? Have you done any data cleaning on it before you see this issue?
hi @vhurtadol , pan and i both reviewed the CSV files and are not seeing the issues with the 49 columns that you are seeing that you mention above. #1007 (comment). we have opened the csv that you have attached here as well as directly from participedia using Excel, Numbers and google sheets and are not able to reproduce the issue you see. What application are you using to view the csv? Have you done any data cleaning on it before you see this issue?
Hi @ascott, I was opening the csv with excel and R, and I don't do any data cleaning before noticing the issue with the rows. I'll try downloading it in another computer this evening and check... it might just be my laptop!
hi @vhurtadol , pan and i both reviewed the CSV files and are not seeing the issues with the 49 columns that you are seeing that you mention above. #1007 (comment). we have opened the csv that you have attached here as well as directly from participedia using Excel, Numbers and google sheets and are not able to reproduce the issue you see. What application are you using to view the csv? Have you done any data cleaning on it before you see this issue?
Hi @ascott, I was opening the csv with excel and R, and I don't do any data cleaning before noticing the issue with the rows. I'll try downloading it in another computer this evening and check... it might just be my laptop!
Hi @ascott and @paninee, I opened the csv elsewhere and it works! So, it was just my Excel. Thanks for checking that it worked fine.
@vhurtadol @ascott @jesicarson I downloaded the OECD case collection & converted the CSV to XLSX (attached). At first glance it looks good, but there are problems with rows 9,10, 200 & 201 . (See next note for file.) There are also rows (e.g. 163, 210, 259) where the "brief description" text (column E) appears to be mis-formatted
@paninee when you have time please review the issue pat has identified (related to #1006) . i'm not sure if the problem is happening because the OECD collection (https://participedia.net/collection/6786) is hidden? i opened pat's sheet in excel and see that the rows he identified (9,10, 200 & 201) do look incorrect. Also, when i downloaded the results for cases for that same collection I see different rows with the same problems (12, 194). here's the csv file i downloaded. here are screenshots of the file:
@ascott @jesicarson I hope we can work on the download issues soon after the holiday break. In addition to other things we've noticed, Today I did a csv download of all case data. I am working in Windows, so the file opened in Excel. I highlighted 49 rows where I saw problems, but when I saved the file and then reopened it, none of my changes were saved. @vhurtadol thinks this problem will go away if users save data downloads as an XLSX . Is this something we can/should apply as our default setting? Or would it be better to give users a clear warning as to how they should save the file?
This issue is on the agenda for the week of Jan 4-11 https://github.com/orgs/participedia/projects/2
@plscully please have a look through this document and let us know which, if any, of the "not applicable" or "don't know" values are necessary in the form. Ideally we'd remove all NA values so that the CSV works better in R (per @vhurtadol 's suggestions).
@vhurtadol I agree that we need to delete unnecessary values for NA or DK, but I'm embarrassed to say that my knowledge of how to code survey values is so out of date that I don't know why including an NA value creates a problem. I had thought that every response translated into a 0 or 1. I am so out of date that I don't understand how "factor" values work, but I'm guessing that this is why we can't simply code an NA response as 0. ... I only now realized that this is a different approach than the one we used for our last codebook that @Mattygryan developed in 2013. (See here.) ... I also don't understand what the effect is of not including specific labels for each value. For ex, in the 2013 codebook one of the options under the "Types of interaction among participants" was labelled as "PrticpntIntrctnDscssn," but the new codebook labels each options as "Type of Interaction 1" and so on. ... Please understand, I am not saying that we need to do this differently, it's just that I don't think I know enough about how someone would use this new type of codebook to be able to provide useful suggestions.
@vhurtadol I agree that we need to delete unnecessary values for NA or DK, but I'm embarrassed to say that my knowledge of how to code survey values is so out of date that I don't know why including an NA value creates a problem. I had thought that every response translated into a 0 or 1. I am so out of date that I don't understand how "factor" values work, but I'm guessing that this is why we can't simply code an NA response as 0. ... I only now realized that this is a different approach than the one we used for our last codebook that @Mattygryan developed in 2013. (See here.) ... I also don't understand what the effect is of not including specific labels for each value. For ex, in the 2013 codebook one of the options under the "Types of interaction among participants" was labelled as "PrticpntIntrctnDscssn," but the new codebook labels each option as "Type of Interaction 1" and so on. ... Please understand, I am not saying that we need to do this differently, it's just that I don't think I know enough about how someone would use this new type of codebook to be able to provide useful suggestions.
Hi @plscully! I think the main issue with having NA as an answer in the form is that it might mean something different than when users don't fill out the form. So we can either decide that all questions should have an option of "not applicable" or "NA" and that they count the same as a missing value (somebody who didn't answer the form, a blank space in the dataset). If this is what you think works best for the users, then I can specify this in the codebook or (and this is not knowing how hard this option is) when the data is collected in the CSV, all the "Not applicable" and "NA" are transformed into missing values (blank spaces).
I'm not sure what do you mean by the NA being coded as 0, but my immediate answer is that in some cases we can have meaningful 0 values (for example, when 0 means no, which is not the same as a missing value) and the stats software will not read the 0 as a missing value. In my experience, these programs prefer that NA is just missing (blank).
About the new codebook, I am following the labels used in the CSV. I can include the value labels for all the variables if you think this is more helpful. V-Dem does that but includes a very long document for their codebook (https://www.v-dem.net/en/data/data/v-dem-dataset/). My main concern was that it might be too bulky (not very user-friendly) especially because most of the values are very intuitive. Let me know what do you think!
@vhurtadol @plscully I haven't read up the whole thread - just skimmed - but haven't checked in on github in a while til i got the notification and delighted by the progress you have all made. The community should be very grateful.
I'd be very keen that we can differentiate missing values from a positive choice made by respondents (in fact it is essential). I might be completely mansplaining now as i haven't read the whole thread and it could be the blind leading the blind but in case it helps @plscully I think the issue is that the default value that R programming language (and some other languages) have for a missing value is NA [when i did the codebook in 2013 i wasn't an R user and probably didn't even know what it was, i was still congratulating myself when i could write a line of code in SPSS VBA language]. Anyway if we have values for NA that mean something else than 'missing value' we have a problem when we import to R. So we need to change our NAs to something else to distinguish something someone positively said is 'not applicable' vis a non-response. So a not applicable or don't know can take a value of 0 (or whatever). A decision needs to be made as to whether such values can be combined. So what we want is some like 'not applicable/dk' has a value of 0, missing values have a value of NA, and other responses can be 1 (e.g for Yes), 2 (e.g. for No) etc. Ideally we can have that as consistent as possible throughout.
So I don't know if that helpful or is the blind leading the blind or in best case scenario when dummies try to explain to dummies (which is the niche i have carved out for my career) ;). Vero or others can correct if i have said something silly.
Hi @plscully @vhurtadol @Mattygryan - good to be back at it after the hols!
We can change all the existing Not Applicable / Don't Know coding from "NA" to "not_applicable" and "dont_know" to differentiate from the NA/null in R and solve the problem. We (the dev team) just wanted to make sure you wanted to keep those, since not all fields have them so it's a bit inconsistent. We assumed that if the answer was not applicable or you don't know it, you'd leave the field blank. But obviously we defer to the data analysts on this one! Thanks for your help. Again, if you want to skim visually, I've taken screenshots of the fields in question here.
@Mattygryan @vhurtadol @jesicarson I think I am beginning to understand the problem and what can be done to fix it, but wonder if the quickest & easiest way to figure this out would be a brief Zoom call. Matt -- If you've not yet seen it, please have a look through this document Jesi created that shows the part of the case data form where we include Not Applicable and Don't Know options. It also looks as though even including the work "no" in a value's label causes problems. ... And if you think it would be useful, I made this googledoc copy of our detailed outline of the case data entry forms if we want to use the "suggest" and comment features to insert new values labels.
This week I can chat after 10am Weds, after 9am Thurs, or after 11am Fri (PST). If we can get on the same page I can update Pan and Alanna next week.
@plscully sure! I can chat on Thursday and Friday (after 11am) as well.
I can jump on the phone briefly Friday evening. We are back in a full lockdown so I won't be in the pub on a Friday :)
Generally it is best if we can differentiate non-response as a seperate category. Then it is easy for the user to decide if they want to the no/n-a/dk all together or not.
@Mattygryan @vhurtadol @plscully I sent out a calendar invite with a zoom link for 11am PST tomorrow (friday, jan 8). Let me know if you didin't get it or need the link. talk soon!
@jesicarson its in my diary :)
Fixed Excel issue and sort the CSV by IDs. Tested on stage and deployed to Prod.
The cases' CSV file shows 49 rows that have incorrect information. I think the backend code is not assigning the values to the correct columns for these cases.
The rows where this is a problem are:
16 28 51 61 63 77 81 157 241 351 416 432 641 652 687 708 924 931 934 937 938 949 963 990 1000 1038 1138 1214 1226 1240 1254 1305 1327 1375 1482 1483 1484 1507 1508 1509 1510 1511 1512 1513 1529 1530 1535 1551 1553 1557 1672