Closed henriquesposito closed 2 years ago
Visualisations for
The visualisations should be a 2D "mosaic" of a consolidated database. Five categories should be included:
Question: how to do so in a computationally efficient way?
Possible answer: instead of retaining all data type information from datasets and consolidated database, works with arrays and the combinations/comparisons of rows/cols in these for a consolidated database.
For such approach, a broadly consolidated database might be ideal (rows and columns present in all datasets)... For other types of consolidated databases, a similar mosaic with information about which datasets the data comes from, instead of the categories above, could also work.
A very rough first (working) draft of the function (I have also pushed these changes to manydata):
db_plot <- function(database, key) {
if(length(grepl(key, purrr::map(database, names))) != length(database)) {
stop("Please declare a key variable present in all datasets in the database.")
}
# reduce database
out <- purrr::reduce(database, dplyr::full_join, by = key)
# get variable names, but ID
all_variables <- unname(unlist(purrr::map(database, names)))
all_variables <- all_variables[!all_variables %in% key]
# create an empty data frame
db <- data.frame(out[,1], stringsAsFactors = TRUE)
# check if values are missing, confirmed, conflicting, asymmetrical, or unique
for (var in all_variables) {
vvars <- paste0("^", var, "$|^", var, "\\.")
vars_to_combine <- grepl(vvars, names(out))
if (length(out[vars_to_combine]) > 1) {
# paste variables to work at the string value
value <- apply(out[vars_to_combine], 1, paste, collapse = "!")
# weird code added to some variables? need to double check messydates...
value <- stringr::str_remove_all(value, "\032")
# remove string duplicates and collapse unique values
value <- sapply(stringr::str_split(value, "!"),
function(x) {paste(unique(trimws(x)), collapse = '!')})
# work with fractions to determine category
value <- ifelse(stringr::str_detect(value, "^NA$|^na$|^NA_character$"),
"missing", value)
value <- ifelse(stringr::str_count(value, "\\!") ==
(length(out[vars_to_combine]) - 1), "conflict", value)
value <- ifelse(stringr::str_count(value, "\\!") == 0 &
!grepl("^missing$|^conflict$", value),
"confirmed", value)
value <- ifelse(!grepl("^missing$|^conflict$|^confirmed$", value),
"asymmetric", value)
} else {
value <- out[vars_to_combine]
value <- ifelse(is.na(value), "missing", "unique")
}
# fill df
db[, var] <- value
}
# gather data
dbgather <- db %>%
dplyr::select(-key) %>%
tidyr::pivot_longer(cols = everything(), names_to = "variable",
values_to = "category") %>%
dplyr::group_by(variable, category) %>%
dplyr::summarise(count = n()) %>%
dplyr::mutate(perc = count / sum(count)) %>%
tidyr::pivot_wider(id_cols = variable, names_from = category,
values_from = perc) %>%
dplyr::mutate(across(everything(), ~tidyr::replace_na(.x, 0))) %>%
tidyr::pivot_longer(asymmetric:confirmed) %>%
dplyr::rename(Variables = variable, Category = name, Percentage = value)
# plot
ggplot(dbgather, aes(fill = Category, y = Percentage, x = Variables )) +
geom_bar(position="fill", stat="identity") +
coord_flip()
}
# example: db_plot(database = manydata::emperors, key = "ID")
Of course the function needs be made more concise, the visuals adapted, and the name changed. However, before doing so, there are a few things we might want to decide on.
1- The big question here is what to do with NAs when there are conflicts? Suppose, for example, we have 2 datasets that mark an observation as NA, but another one that has a non-NA value for it. Is this asymmetric conflict or a unique value? Or, alternatively, suppose we have 2 datasets that have different values for an observation and another that marks it as NA, is this a conflict or an asymmetric conflict? Should we count or remove NAs in these cases? I am keeping NAs as meaningful observations for now but I am not sure this is the best approach. As it stands we can only have unique values in variables present in one dataset. Most missing values are coded in these variables since whenever there is a conflict NAs are taken as meaningful.
2- Another consideration is what to do with observations for variables present and confirmed in multiple datasets but these variables are not present in all datasets? For now these are treated as confirmed. Do we set a level of "confirmation" for these (e.g. 100% (all datasets in database), 0.66% (2/3), 50% (1/2))?
3- Specifically to date variables, what do we do if dates have the same year but one observation has the year only while other observation is more precise (e.g. ymd)? Are these the same or not?
@jhollway and @jaeltan please let me know what you think about the issues raised above. Once we have ironed these out I can adapt the function accordingly and work on improving the output plot.
1- The big question here is what to do with NAs when there are conflicts? Suppose, for example, we have 2 datasets that mark an observation as NA, but another one that has a non-NA value for it. Is this asymmetric conflict or a unique value?
This is unique.
Or, alternatively, suppose we have 2 datasets that have different values for an observation and another that marks it as NA, is this a conflict or an asymmetric conflict?
This is a conflict.
Should we count or remove NAs in these cases? I am keeping NAs as meaningful observations for now but I am not sure this is the best approach. As it stands we can only have unique values in variables present in one dataset. Most missing values are coded in these variables since whenever there is a conflict NAs are taken as meaningful.
Ignore NAs when there is at least one observation.
2- Another consideration is what to do with observations for variables present and confirmed in multiple datasets but these variables are not present in all datasets? For now these are treated as confirmed. Do we set a level of "confirmation" for these (e.g. 100% (all datasets in database), 0.66% (2/3), 50% (1/2))?
No, just confirmed. If an observation is corroborated by two sources, this is enough. Data quality should not go down if more, smaller datasets are added.
3- Specifically to date variables, what do we do if dates have the same year but one observation has the year only while other observation is more precise (e.g. ymd)? Are these the same or not?
Good question. Is this confirmed, asymmetric conflict, or unique? Let's call it an asymmetric conflict for now, what do you think @jaeltan ?
I tried the plot and have a few comments:
theme_minimal()
?What other aesthetic suggestions do you have, @jaeltan ?
Thanks @henriquesposito for the function! In response to some of the comments already made:
Another thought I had was whether it would be useful to compare what we have in the GNEVAR dataset (if there is one in the database) to the observations in the other datasets in the same database? Since we are adding corrections to the data in the GNEVAR datasets this should be a kind of benchmark to check what we have.
Thank you so much for the helpful feedback @jhollway and @jaeltan !
I updated the colors and theme, dropped ID columns, and re-worked the function to solve variables as discussed above. I am currently working on how to handle dates and code asymmetric/conflict when there are multiple NAs. When it comes to dates, I am not sure we should call similar dates with different levels of precision (e.g. YYYYMMDD and YYYY) asymmetric, it may confuse users who would be inclined to think that observation is confirmed in multiple datasets but not all ... I believe we should either make a choice to call them confirmed (treat them as equivalent) and add a note for date variables with different levels of precision or call it a conflict (treat them as not equal). In this case, asymetric would mean indeed multiple confirm but not all for consistency.
Besides this, what other features should we add? Should we, for example, allow users to pick columns they are interested in (or columns that appear in more than one dataset)? Could they pick specific datasets to compare within a database?
Great, thanks @henriquesposito . Would you mind appending the latest plot to your comment so that we can see what it looks like now?
I agree with you both that 'asymmetric conflict' sounds a bit negative and/or misleading. How about:
I invite your feedback on this @henriquesposito and @jaeltan
I would agree with this classification @jhollway but two comments:
Ok, issues with determining which observations are conflicting and which are asymmetrical have been fixed. The function now correctly identifies all categories (though it still needs to be made more concise and efficient).
@jhollway and @jaeltan thank you for the comments! I have not yet implemented the new categories/colors, but below is the latest plot:
manydata::db_plot(manyenviron::agreements, "manyID")
Now, when it comes to the new categories, I would perhaps argue that we should not determine/code based on the level of precision/ambiguity for values (unless these values are dates, for which we have methods to determine levels of precision). I would favor a simpler and less "opinionated" approach. This also might make the plot easier for users to understand. I do not mind "asymmetric" (without the word conflict perhaps) as a category. Maybe we can even add a TRUE/FALSE argument so users can choose how precise and imprecise (but similar) dates should be treated.
1- Confirmed values are the same in all datasets in database. 2- Unique values appear once in datasets in database. 3- Missing values are missing in all datasets in database. 4- Conflicting values are different in the same number of datasets in database. 5- Asymmetric values have the same value in multiple, but not all, datasets in database.
Perhaps more important would be the ability to show users which variables appear in multiple datasets in the database so that they can make better sense of the plot as a whole (e.g. are unique/missing values in a variable that appears in one dataset or a variable that appears in multiple datasets?). Indeed @jaeltan perhaps re-ordering the variables can help.
Thanks @henriquesposito. Some further suggestions, no particular order:
Latest version:
db_plot(manyenviron::agreements, "manyID")
As for adding sub-labels (or hover) to variables, I agree we should, but the issue is that we often do not have easily available information on many of these variables (besides the variables we code often as Beg, End, Title). What types of information should we add? Notice I also added the number of datasets variable appears in the latest version of the function.
One more thing any suggestions for the name of the function?
Hi @henriquesposito , it's definitely getting there, isn't it? Here are some additional/remaining tweaks:
"1.00 \newline (8689)"
?Actually, on second thoughts, the green in that spectrum is too lurid, but the rest, e.g. cyan and yellow, helps with a visual ordering, no? Perhaps the green could be toned down just a bit?
This looks really good @henriquesposito . How are the variables ordered? By number of datasets in which the variable is present? And within that? Maybe prioritise non-missing observations? confirmed observations? Non-conflict non-missing observations?
Variables, in theory, should be ordered by the percentage of non-missing observations... though this does not seem to be the case in plot above. I am still trying to understand why this is happening, I will update the plot and repost the plot here soon.
Super! I find this really informative:
What do you think @jaeltan? What else can/could we learn from such a graph?
Thank you @henriquesposito for the great work, I agree that the plot looks amazing! I don't think the unique observations are much of an issue but it would be helpful to know whether they came from the GNEVAR dataset. I'm also wondering why the number of conflicting observations for Title seems relatively high, do we need to check or improve how manyID/treatyID are generated from the Title and Signature variables?
Thank you both for the great feedback, indeed I think we need to provide more info about the dataset sources of the variables (specially considering we have GNEVAR data in many of the databases). For now I am going to try and play a bit with some interactive hoovering so that we can bring this information up without being too much in the plot itself. Please let me know if you have any other ideas about possible additions or revisions to the plot as well.
As for the conflicts in title, I am not sure where these conflicts are coming from, we should definitely investigate them. Alternatively, some of the conflicts in "Beg" are likely generated by different date precision levels which I have not yet dealt with.
I don't think everything needs to be in the plot -- it just needs to provoke the right questions.
Perhaps an additional function that calls and juxtaposes observations from each dataset that meet some criteria could be useful? For example, now that we now that there are some title conflicts, can we bring up a data frame or list of data frames that show the observations that have title conflicts? Then we just export this to a CSV for a student assistant to tackle.
I am not sure this is what we are looking for, but I have made an interactive version of the plot that allows users to hoover over the image and see the datasets that certain variable belongs to, and the percentage for the category they are hoovering above. I find it very informative but this adds an extra package dependency...
I am not able to attach it as a web page here so that you can play around with the image below, but I ask that you please take a look at the develop branch of manydata (you can just run the examples there) and let me know what you think.
Todo:
{messydates}
I reverted the interactive portion and added a first attempt to deal with date precision and comparison. I am still working on the new function for specific variable(s) and observations, as well as making the current function more concise/efficient.
One more thing, this may be sunk costs speaking, but I do find the interactive version of the plot more informative to users, especially the option of hovering to see the datasets each variable comes from. This is pertinent since we have GNEVAR datasets in many databases. So maybe we could possibly find ways in which to display the datasets certain variables come from (instead of the number for example) in the non interactive plot?
Perhaps different shades of yellow for which dataset the observation uniquely appears in? But how would this work for all the potential combinations for the confirmed, majority, and conflict categories?
@jhollway and @jaeltan I have added a new dbcomp()
function that compares variables within datasets, as requested. The function allows to see how all rows in variable are code or specify a category of interest. Multiple variables can also be specified as a list. The function returns a tibble with the key, the original values in each datasets, and the coded category. Please let me know what you think and if you have any suggestions/comments.
If you load the development version of data you should be able to run the following, for example:
dbcomp(database = emperors, variable = "Beg", key = "ID")
# gets how all the rows in "Beg" were coded
dbcomp(database = manyenviron::agreements, variable = "Title", category = "conflict")
# gets all the titles that are different, but very likely refer to the same treaty, that we managed to condense into the same manyID
Notice that I have not yet implemented the messydates precision comparison here (as I did for the plot), the more I think about this issue the more I think going down in precision for matching dates might actually defeat the purpose of messydates (i.e. we are doing exactly the same thing we argue against). Should we instead set this up as an extra argument for users (rather than an automated process in the function)?
@henriquesposito thanks for the function, I think it works pretty well! I'm just wondering if it would be helpful to add other variables like the Beg
date so that we know whether they are actually the same agreements or not?
@jaeltan thank you for the feedback! I see your point, however, the variable argument takes more than one variable as lists. So, in this case users could run dbcomp(database = manyenviron::agreements, variable = c("Title", "Beg"), category = "conflict")
and get conflicts in both title and beg variables and check if they are the same agreement.
I'm not sure if this is what @jaeltan means, but it could be helpful to list also other variables that are not the conflict per se but could provide information that could help resolve the conflict. Similarly, it could be helpful to include the link or such to the treaty texts so that it accelerates work resolving the conflicts.
Visual functions should show what is already included in the data, to help us diagnose where further development is needed in the data (i.e. what is missing and where) and to communicate better why the package is important/impactful.