globalgov / manydata

The portal for global governance data
https://manydata.ch
GNU Affero General Public License v3.0
9 stars 0 forks source link

Allow for key vector in consolidate() instead of single key variable #202

Closed BBieri closed 2 years ago

BBieri commented 2 years ago

dplyr::inner_join() and dplyr::full_join() allow for a named vector to be passed to the key argument. We could implement this feature as well in our consolidate() function to allow for an easier joining of panel data for instance where individual observations are identified by a "country-year" pair. Here is an example:

consolidate(emperors, "any", "every", resolve = "min", key = c("ID" = "ID", "Year" = "Year"))
jhollway commented 2 years ago

Yes, this would be excellent.

henriquesposito commented 2 years ago

I have been playing with this idea for a bit. Having multiple keys is not an issue for the first few parts of the function that build purrr::reduce and the dplyr joins. And with a few tweaks here and there in the later parts of consolidate(), in theory, this is working now, as long as the keys are present in all datasets in a database of course. For example:

consolidate(database = emperors, rows = "any", cols = "any", resolve = "coalesce", key = c("ID", "Beg")) consolidate(database = emperors, rows = "any", cols = "every", resolve = "min", key = c("ID", "Beg")) consolidate(database = emperors, rows = "every", cols = "any", resolve = "max", key = c("ID", "Beg"))

The number of observations returned generally increases with two keys since matched rows are fewer if we use "any" for rows or columns. However, if you set the rows and columns arguments to "every" the consolidated dataset returned can have no rows if none of them are matched across all datasets. For example:

consolidate(database = emperors, rows = "every", cols = "every", resolve = "max", key = c("ID", "Beg"))

The only issue we might have with multiple keys is when used to resolve different variables in different ways. This is because the class of the keys are modified to character when you declare multiple keys. For example (not working):

consolidate(database = emperors, rows = "every", cols = "every", resolve = c(Beg = "min", End = "max"), key = c("ID", "Beg"))

While this enhancement works for the most part and is already implemented consolidate, I am not sure if we should document it now or in a future release. Perhaps we should understand better what we want to achieve with this, what are the advantages, and how to solve the class issue for when resolving multiple variables differently perhaps. @jhollway and @BBieri what do you think?

jhollway commented 2 years ago

Hi @henriquesposito , indeed lots to consider here.

  1. Do the same named keys need to be in every database, or can we allow matching as in dplyr, e.g. key = c("ID" = "id")?
  2. Yes, every/every would be a pretty demanding expectation
  3. When would one resolve a variable that you are using as a key?
henriquesposito commented 2 years ago

I apologise for the mistake, yes, one should not resolve a variable that you are using as a key. The function indeed works for resolving multiple variables differently if we have multiple keys, for example:

consolidate(database = emperors, rows = "any", cols = "any", resolve = c(Death = "max", Cause = "coalesce"), key = c("ID", "Beg"))

Ideally the same named keys should to be in every database for consolidate(), however, matching is possible if equivalent key columns across datasets have a different name. Say for instance that the id column in the wikipedia dataset for the emperors database is named "id" and the id column in the other datasets are named "ID", in that case we can indeed use key = c("ID" = "id") to declare that these keys are the same. For example:

library(manydata)
w <- emperors$wikipedia
u <- emperors$UNRV
b <- emperors$britannica
w <- dplyr::rename(w, id = "ID")
e <- tibble::lst(w, u, b)
consolidate(e, "any", "any", "coalesce", key = c("id" = "ID"))

I will update the function documentation to include a description for the possibility of using multiple keys and matching of differently named key variables.