markmfredrickson / optmatch

Functions for optimal matching in R
https://markmfredrickson.github.io/optmatch
Other
47 stars 14 forks source link

Special `match_on()` action when `dat=` a tibble? #191

Open benthestatistician opened 4 years ago

benthestatistician commented 4 years ago

Tibbles typically have non-informative row-names. Tidyverse processes that generate tibbles may silently replace informative rownames with non-informative ones. Since we place meaning in row names, problems may ensue.

190 addresses the worst of the tibble-trouble, but we can certainly move further toward accommodating tidyverse users. Perhaps it would be helpful if match_on() returned an ISM without row names when given a tibble in its dat= slot. Since they're no informative anyway. OTOH, maybe nameless ISMs/DenseMatrix-es would generate too many warnings and trouble with other functions to be useful.

Discuss.

benthestatistician commented 4 years ago

The use case I had in mind is the one described in comments to #190:

  1. User fits psmA using data dat; then distA = match_on(psmA, data=dat). At this point dat has informative row names.
  2. User then applies dplyr::mutate() to add columns to dat, overwriting dat with the result of this. (From the user's perspective, it's the same data set, just with some new columns added on.)
  3. User now fits psmB, again using data dat; distB = match_on(psmB, data=dat).
  4. When user does fullmatch(psmA + psmB), problem is declared to be infeasible.

What's happened here is that at step 2, mutate silently dropped the informative row names from step 1, replacing them w/ as.character(1L:nrow(dat)); then at step 4, the ISM psmA + psmB forbids all matches, due to incompatible dimnames.

190 addresses the scenario where the dat of step 1 is a data frame, whereas the dat of step 3 is a tibble: that user now gets an informative error. The user can fix this error by casting dat as a tibble before ever giving it to match_on(), so that both ISMs carry the same (non-informative) row and column names.

But that's still annoying. This user would have a much easier time if we simply let them proceed, with a warning. Given the rules for ISM arithmetic that #190 puts into place, that's what would happen if at stage 3 we produced an ISM without row or column names, as I propose in this issue.