Add visuals for databases/datasets

henriquesposito commented 2 years ago

Visual functions should show what is already included in the data, to help us diagnose where further development is needed in the data (i.e. what is missing and where) and to communicate better why the package is important/impactful.

jaeltan commented 2 years ago

Visualisations for

how the datasets relate to one another
where there are missing data
lineages

henriquesposito commented 2 years ago

Some references:

henriquesposito commented 2 years ago

The visualisations should be a 2D "mosaic" of a consolidated database. Five categories should be included:

Confirmed (color blue) = all datasets confirm value for row/column combination.
Unique (color green) = value for row/column combination is present in only one dataset within database.
Conflicting (color red) = value for row/column combination is conflicting across datasets within database.
Asymetric = value for row/column combination is conflicting across some dataset(s) but equal for other datasets within database.
Missing (color white) = value for row/column combination is NA

Question: how to do so in a computationally efficient way?

Possible answer: instead of retaining all data type information from datasets and consolidated database, works with arrays and the combinations/comparisons of rows/cols in these for a consolidated database.

For such approach, a broadly consolidated database might be ideal (rows and columns present in all datasets)... For other types of consolidated databases, a similar mosaic with information about which datasets the data comes from, instead of the categories above, could also work.

henriquesposito commented 2 years ago

A very rough first (working) draft of the function (I have also pushed these changes to manydata):

db_plot <- function(database, key) {
  if(length(grepl(key, purrr::map(database, names))) != length(database)) {
    stop("Please declare a key variable present in all datasets in the database.")
  }
  # reduce database
  out <- purrr::reduce(database, dplyr::full_join, by = key)
  # get variable names, but ID
  all_variables <- unname(unlist(purrr::map(database, names)))
  all_variables <- all_variables[!all_variables %in% key]
  # create an empty data frame
  db <- data.frame(out[,1], stringsAsFactors = TRUE)
  # check if values are missing, confirmed, conflicting, asymmetrical, or unique
  for (var in all_variables) {
    vvars <- paste0("^", var, "$|^", var, "\\.")
    vars_to_combine <- grepl(vvars, names(out))
    if (length(out[vars_to_combine]) > 1) {
      # paste variables to work at the string value
      value <- apply(out[vars_to_combine], 1, paste, collapse = "!")
      # weird code added to some variables? need to double check messydates...
      value <- stringr::str_remove_all(value, "\032")
      # remove string duplicates and collapse unique values
      value <- sapply(stringr::str_split(value, "!"),
                      function(x) {paste(unique(trimws(x)), collapse = '!')})
      # work with fractions to determine category
      value <- ifelse(stringr::str_detect(value, "^NA$|^na$|^NA_character$"),
                      "missing", value)
      value <- ifelse(stringr::str_count(value, "\\!") ==
                        (length(out[vars_to_combine]) - 1), "conflict", value)
      value <- ifelse(stringr::str_count(value, "\\!") == 0 &
                        !grepl("^missing$|^conflict$", value),
                      "confirmed", value)
      value <- ifelse(!grepl("^missing$|^conflict$|^confirmed$", value),
                      "asymmetric", value)
      } else {
        value <- out[vars_to_combine]
        value <- ifelse(is.na(value), "missing", "unique")
    }
    # fill df
    db[, var] <- value
  }
  # gather data
  dbgather <- db %>%
    dplyr::select(-key) %>% 
    tidyr::pivot_longer(cols = everything(), names_to = "variable",
                        values_to = "category") %>% 
    dplyr::group_by(variable, category) %>%
    dplyr::summarise(count = n()) %>%
    dplyr::mutate(perc = count / sum(count)) %>% 
    tidyr::pivot_wider(id_cols = variable, names_from = category,
                       values_from = perc) %>% 
    dplyr::mutate(across(everything(), ~tidyr::replace_na(.x, 0))) %>% 
    tidyr::pivot_longer(asymmetric:confirmed) %>% 
    dplyr::rename(Variables = variable, Category = name, Percentage = value)
  # plot
    ggplot(dbgather, aes(fill = Category, y = Percentage, x = Variables )) + 
      geom_bar(position="fill", stat="identity") +
      coord_flip()
}
# example: db_plot(database = manydata::emperors, key = "ID")

Of course the function needs be made more concise, the visuals adapted, and the name changed. However, before doing so, there are a few things we might want to decide on.

1- The big question here is what to do with NAs when there are conflicts? Suppose, for example, we have 2 datasets that mark an observation as NA, but another one that has a non-NA value for it. Is this asymmetric conflict or a unique value? Or, alternatively, suppose we have 2 datasets that have different values for an observation and another that marks it as NA, is this a conflict or an asymmetric conflict? Should we count or remove NAs in these cases? I am keeping NAs as meaningful observations for now but I am not sure this is the best approach. As it stands we can only have unique values in variables present in one dataset. Most missing values are coded in these variables since whenever there is a conflict NAs are taken as meaningful.

2- Another consideration is what to do with observations for variables present and confirmed in multiple datasets but these variables are not present in all datasets? For now these are treated as confirmed. Do we set a level of "confirmation" for these (e.g. 100% (all datasets in database), 0.66% (2/3), 50% (1/2))?

3- Specifically to date variables, what do we do if dates have the same year but one observation has the year only while other observation is more precise (e.g. ymd)? Are these the same or not?

@jhollway and @jaeltan please let me know what you think about the issues raised above. Once we have ironed these out I can adapt the function accordingly and work on improving the output plot.

jhollway commented 2 years ago

1- The big question here is what to do with NAs when there are conflicts? Suppose, for example, we have 2 datasets that mark an observation as NA, but another one that has a non-NA value for it. Is this asymmetric conflict or a unique value?

This is unique.

Or, alternatively, suppose we have 2 datasets that have different values for an observation and another that marks it as NA, is this a conflict or an asymmetric conflict?

This is a conflict.

Should we count or remove NAs in these cases? I am keeping NAs as meaningful observations for now but I am not sure this is the best approach. As it stands we can only have unique values in variables present in one dataset. Most missing values are coded in these variables since whenever there is a conflict NAs are taken as meaningful.

Ignore NAs when there is at least one observation.

2- Another consideration is what to do with observations for variables present and confirmed in multiple datasets but these variables are not present in all datasets? For now these are treated as confirmed. Do we set a level of "confirmation" for these (e.g. 100% (all datasets in database), 0.66% (2/3), 50% (1/2))?

No, just confirmed. If an observation is corroborated by two sources, this is enough. Data quality should not go down if more, smaller datasets are added.

3- Specifically to date variables, what do we do if dates have the same year but one observation has the year only while other observation is more precise (e.g. ymd)? Are these the same or not?

Good question. Is this confirmed, asymmetric conflict, or unique? Let's call it an asymmetric conflict for now, what do you think @jaeltan ?

jhollway commented 2 years ago

I tried the plot and have a few comments:

what do you think about ignoring the ID columns? The actual amount of pairwise overlap between datasets could be a different diagnostic. The current plot is about the state of the (consolidated) database as a whole.
can we try theme_minimal()?

What other aesthetic suggestions do you have, @jaeltan ?

jaeltan commented 2 years ago

Thanks @henriquesposito for the function! In response to some of the comments already made:

I agree that we could drop the ID variables since the observations are already matched by manyID.
Could we change the colour for 'missing' to grey instead of 'white'? On a light background in R white is not very clear.
Not sure if this is helpful, but could we sort the variables like we do for the datasets?
If we are adding these plots to the readme files/websites maybe we need a label for the plot so that we know which database it is
I'm not very sure what asymmetric conflict is supposed to mean. Does it mean that one of the datasets has a more reliable observation that we 'prefer'? Henrique and I had a small discussion about the date variables last week, and I think while the YYYYMMDD is definitely preferable to YYYY for the date observation, I don't think it's really a conflict if the year is the same.

Another thought I had was whether it would be useful to compare what we have in the GNEVAR dataset (if there is one in the database) to the observations in the other datasets in the same database? Since we are adding corrections to the data in the GNEVAR datasets this should be a kind of benchmark to check what we have.

henriquesposito commented 2 years ago

Thank you so much for the helpful feedback @jhollway and @jaeltan !

I updated the colors and theme, dropped ID columns, and re-worked the function to solve variables as discussed above. I am currently working on how to handle dates and code asymmetric/conflict when there are multiple NAs. When it comes to dates, I am not sure we should call similar dates with different levels of precision (e.g. YYYYMMDD and YYYY) asymmetric, it may confuse users who would be inclined to think that observation is confirmed in multiple datasets but not all ... I believe we should either make a choice to call them confirmed (treat them as equivalent) and add a note for date variables with different levels of precision or call it a conflict (treat them as not equal). In this case, asymetric would mean indeed multiple confirm but not all for consistency.

Besides this, what other features should we add? Should we, for example, allow users to pick columns they are interested in (or columns that appear in more than one dataset)? Could they pick specific datasets to compare within a database?

jhollway commented 2 years ago

Great, thanks @henriquesposito . Would you mind appending the latest plot to your comment so that we can see what it looks like now?

I agree with you both that 'asymmetric conflict' sounds a bit negative and/or misleading. How about:

consensus [blue]: two or more datasets give same value and no other values given OR other values are less precise
majority [cyan]: datasets give different non-missing values, but majority in favour of certain value OR one value is more precise or compatible
unconfirmed [green]: only one dataset gives a non-missing value
imprecise [yellow]: only one dataset gives a non-missing value, but it is imprecise/ambiguous/etc
conflict [red]: even number of datasets give different non-missing values
missing [grey, or none?]: no non-missing value is observed in any of the constituent datasets

I invite your feedback on this @henriquesposito and @jaeltan

jaeltan commented 2 years ago

I would agree with this classification @jhollway but two comments:

Are we defining majority as in at least half of the number of datasets?
For consensus in a date variable, what if only one of the datasets gives a precise YYYYMMDD value and the other datasets only have YYYY or missing observations? Can we say there is still a consensus as long as the year is the same?

jhollway commented 2 years ago

Majority is the majority of datasets that have some non-missing value
Yes, I think that's right @jaeltan, but I'm open to arguments that it is instead cyan or green...

henriquesposito commented 2 years ago

Ok, issues with determining which observations are conflicting and which are asymmetrical have been fixed. The function now correctly identifies all categories (though it still needs to be made more concise and efficient).

@jhollway and @jaeltan thank you for the comments! I have not yet implemented the new categories/colors, but below is the latest plot:

manydata::db_plot(manyenviron::agreements, "manyID")

Now, when it comes to the new categories, I would perhaps argue that we should not determine/code based on the level of precision/ambiguity for values (unless these values are dates, for which we have methods to determine levels of precision). I would favor a simpler and less "opinionated" approach. This also might make the plot easier for users to understand. I do not mind "asymmetric" (without the word conflict perhaps) as a category. Maybe we can even add a TRUE/FALSE argument so users can choose how precise and imprecise (but similar) dates should be treated.

1- Confirmed values are the same in all datasets in database. 2- Unique values appear once in datasets in database. 3- Missing values are missing in all datasets in database. 4- Conflicting values are different in the same number of datasets in database. 5- Asymmetric values have the same value in multiple, but not all, datasets in database.

Perhaps more important would be the ability to show users which variables appear in multiple datasets in the database so that they can make better sense of the plot as a whole (e.g. are unique/missing values in a variable that appears in one dataset or a variable that appears in multiple datasets?). Indeed @jaeltan perhaps re-ordering the variables can help.

jhollway commented 2 years ago

Thanks @henriquesposito. Some further suggestions, no particular order:

[x] categories should be ordered so that missing comes last; it's the least useful (sic) observation. My preference would be the ordering I proposed in my last comment.
[x] maybe the variables should be on the x-axis so that there is some correspondence to the underlying form?
[ ] these variable names might not mean much without some explanation; is it possible to add more informative explanations on hover or as sub-labels?
[x] maybe there should be an indication somewhere of how many observations there are? It matters whether it's 100 or 100,000 observations...
[x] "manyID" should be the default key, so it doesn't need to be mentioned
[x] "asymmetric" is even more difficult to deparse without "conflict" than with. I prefer "majority" unless there's another suggestion?
[x] can we make the colors (especially the green) a little less lurid?

henriquesposito commented 2 years ago

Latest version:

db_plot(manyenviron::agreements, "manyID")

As for adding sub-labels (or hover) to variables, I agree we should, but the issue is that we often do not have easily available information on many of these variables (besides the variables we code often as Beg, End, Title). What types of information should we add? Notice I also added the number of datasets variable appears in the latest version of the function.

One more thing any suggestions for the name of the function?

jhollway commented 2 years ago

Hi @henriquesposito , it's definitely getting there, isn't it? Here are some additional/remaining tweaks:

[x] Shouldn't need to mention "manyID" as the key argument; it should be the default.
[x] Can we reverse the y-scale/axis so that it starts with 0 and goes down to 1? Missing observations should trail...
[x] The subtitle is good; I wondered whether we could add number of observations information to at least the 0.5 and 1.0 percentage scale, so that it reads e.g. "1.00 \newline (8689)"?
[x] I can see that the variables are ordered alphabetically, which might be helpful for some, but perhaps it could be useful to order them by most non-missing/confirmed observations?
[x] The color scale is an improvement, but it's still quite a jarring color scale, with fairly vibrant or at least rich red and blue in the centre, and more pastel-ly green and orange on the margins. Can we try picking up the design cues from such a spectrum:

jhollway commented 2 years ago

Actually, on second thoughts, the green in that spectrum is too lurid, but the rest, e.g. cyan and yellow, helps with a visual ordering, no? Perhaps the green could be toned down just a bit?

henriquesposito commented 2 years ago

jhollway commented 2 years ago

This looks really good @henriquesposito . How are the variables ordered? By number of datasets in which the variable is present? And within that? Maybe prioritise non-missing observations? confirmed observations? Non-conflict non-missing observations?

henriquesposito commented 2 years ago

Variables, in theory, should be ordered by the percentage of non-missing observations... though this does not seem to be the case in plot above. I am still trying to understand why this is happening, I will update the plot and repost the plot here soon.

henriquesposito commented 2 years ago

jhollway commented 2 years ago

Super! I find this really informative:

Title and Beg are confirmed in roughly half of observations, but then there are many unique observations and even some conflicts...
Many of the other other variables have a majority of non-missing observations as unique -- are these GNEVAR-added variables?
The conflicts are, arguably, quite manageable. That is, the proportion of observations in any variable that are conflicts are relatively few. If GNEVAR doesn't have a value for these cells, then these should be the ones that new project members should concentrate on to move conflict → majority.

What do you think @jaeltan? What else can/could we learn from such a graph?

jaeltan commented 2 years ago

Thank you @henriquesposito for the great work, I agree that the plot looks amazing! I don't think the unique observations are much of an issue but it would be helpful to know whether they came from the GNEVAR dataset. I'm also wondering why the number of conflicting observations for Title seems relatively high, do we need to check or improve how manyID/treatyID are generated from the Title and Signature variables?

henriquesposito commented 2 years ago

Thank you both for the great feedback, indeed I think we need to provide more info about the dataset sources of the variables (specially considering we have GNEVAR data in many of the databases). For now I am going to try and play a bit with some interactive hoovering so that we can bring this information up without being too much in the plot itself. Please let me know if you have any other ideas about possible additions or revisions to the plot as well.

As for the conflicts in title, I am not sure where these conflicts are coming from, we should definitely investigate them. Alternatively, some of the conflicts in "Beg" are likely generated by different date precision levels which I have not yet dealt with.

jhollway commented 2 years ago

I don't think everything needs to be in the plot -- it just needs to provoke the right questions.

Perhaps an additional function that calls and juxtaposes observations from each dataset that meet some criteria could be useful? For example, now that we now that there are some title conflicts, can we bring up a data frame or list of data frames that show the observations that have title conflicts? Then we just export this to a CSV for a student assistant to tackle.

henriquesposito commented 2 years ago

I am not sure this is what we are looking for, but I have made an interactive version of the plot that allows users to hoover over the image and see the datasets that certain variable belongs to, and the percentage for the category they are hoovering above. I find it very informative but this adds an extra package dependency...

I am not able to attach it as a web page here so that you can play around with the image below, but I ask that you please take a look at the develop branch of manydata (you can just run the examples there) and let me know what you think.

Todo:

[x] decide if function will be interactive or not
[x] add a new function that returns a dataframe of where conflicts are, for example, for users
[x] decide what to do with dates, for the time being, dates with different levels of precision are treated as different for consistent with {messydates}
[ ] make function more concise (too long/complex at the moment)

henriquesposito commented 2 years ago

I reverted the interactive portion and added a first attempt to deal with date precision and comparison. I am still working on the new function for specific variable(s) and observations, as well as making the current function more concise/efficient.

One more thing, this may be sunk costs speaking, but I do find the interactive version of the plot more informative to users, especially the option of hovering to see the datasets each variable comes from. This is pertinent since we have GNEVAR datasets in many databases. So maybe we could possibly find ways in which to display the datasets certain variables come from (instead of the number for example) in the non interactive plot?

jhollway commented 2 years ago

Perhaps different shades of yellow for which dataset the observation uniquely appears in? But how would this work for all the potential combinations for the confirmed, majority, and conflict categories?

henriquesposito commented 2 years ago

@jhollway and @jaeltan I have added a new dbcomp() function that compares variables within datasets, as requested. The function allows to see how all rows in variable are code or specify a category of interest. Multiple variables can also be specified as a list. The function returns a tibble with the key, the original values in each datasets, and the coded category. Please let me know what you think and if you have any suggestions/comments.

If you load the development version of data you should be able to run the following, for example:

dbcomp(database = emperors, variable = "Beg", key = "ID")
# gets how all the rows in "Beg" were coded
dbcomp(database = manyenviron::agreements, variable = "Title", category = "conflict")
# gets all the titles that are different, but very likely refer to the same treaty, that we managed to condense into the same manyID

Notice that I have not yet implemented the messydates precision comparison here (as I did for the plot), the more I think about this issue the more I think going down in precision for matching dates might actually defeat the purpose of messydates (i.e. we are doing exactly the same thing we argue against). Should we instead set this up as an extra argument for users (rather than an automated process in the function)?

jaeltan commented 2 years ago

@henriquesposito thanks for the function, I think it works pretty well! I'm just wondering if it would be helpful to add other variables like the Beg date so that we know whether they are actually the same agreements or not?

henriquesposito commented 2 years ago

@jaeltan thank you for the feedback! I see your point, however, the variable argument takes more than one variable as lists. So, in this case users could run dbcomp(database = manyenviron::agreements, variable = c("Title", "Beg"), category = "conflict") and get conflicts in both title and beg variables and check if they are the same agreement.

jhollway commented 2 years ago

I'm not sure if this is what @jaeltan means, but it could be helpful to list also other variables that are not the conflict per se but could provide information that could help resolve the conflict. Similarly, it could be helpful to include the link or such to the treaty texts so that it accelerates work resolving the conflicts.

globalgov / manydata

Add visuals for databases/datasets #214