Closed DavisVaughan closed 4 years ago
I spent some time with these changes and feel good about almost all of it.
Everything that I tried works as I believe we expect except for changing column names using mutate()
, related to "Column names must be exactly the same after sorting them" above. I see the same behavior with both old and new dplyr, which is... good? I think? Except that my feeling is it should fall back to a tibble if the columns are changed in this way.
library(tidymodels)
#> ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.6 ✓ recipes 0.1.12
#> ✓ dials 0.0.6 ✓ rsample 0.0.6
#> ✓ dplyr 0.8.5 ✓ tibble 3.0.1
#> ✓ ggplot2 3.3.0 ✓ tune 0.1.0
#> ✓ infer 0.5.1 ✓ workflows 0.1.1
#> ✓ parsnip 0.1.1.9000 ✓ yardstick 0.0.6
#> ✓ purrr 0.3.4
#> ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step() masks stats::step()
trees_param <- parameters(trees(), min_n(), sample_size())
basic_af <- tibble(id = rep("trees", 3), hello = 1:3)
## look good
trees_param %>% mutate(hello = 1)
#> # A tibble: 3 x 7
#> name id source component component_id object hello
#> <chr> <chr> <chr> <chr> <chr> <list> <dbl>
#> 1 trees trees list unknown unknown <nparam[+]> 1
#> 2 min_n min_n list unknown unknown <nparam[+]> 1
#> 3 sample_size sample_size list unknown unknown <nparam[?]> 1
trees_param %>% arrange(id)
#> Collection of 3 parameters for tuning
#>
#> id parameter type object class
#> min_n min_n nparam[+]
#> sample_size sample_size nparam[?]
#> trees trees nparam[+]
#>
#> Parameters needing finalization:
#> # Observations Sampled ('sample_size')
#>
#> See `?dials::finalize` or `?dials::update.parameters` for more information.
trees_param %>% select(id)
#> # A tibble: 3 x 1
#> id
#> <chr>
#> 1 trees
#> 2 min_n
#> 3 sample_size
anti_join(trees_param, basic_af, by = "id")
#> Collection of 2 parameters for tuning
#>
#> id parameter type object class
#> min_n min_n nparam[+]
#> sample_size sample_size nparam[?]
#>
#> Parameters needing finalization:
#> # Observations Sampled ('sample_size')
#>
#> See `?dials::finalize` or `?dials::update.parameters` for more information.
right_join(trees_param, basic_af, by = "id")
#> # A tibble: 3 x 7
#> name id source component component_id object hello
#> <chr> <chr> <chr> <chr> <chr> <list> <int>
#> 1 trees trees list unknown unknown <nparam[+]> 1
#> 2 trees trees list unknown unknown <nparam[+]> 2
#> 3 trees trees list unknown unknown <nparam[+]> 3
full_join(trees_param, basic_af, by = "id")
#> # A tibble: 5 x 7
#> name id source component component_id object hello
#> <chr> <chr> <chr> <chr> <chr> <list> <int>
#> 1 trees trees list unknown unknown <nparam[+]> 1
#> 2 trees trees list unknown unknown <nparam[+]> 2
#> 3 trees trees list unknown unknown <nparam[+]> 3
#> 4 min_n min_n list unknown unknown <nparam[+]> NA
#> 5 sample_size sample_size list unknown unknown <nparam[?]> NA
left_join(trees_param, basic_af, by = "id")
#> # A tibble: 5 x 7
#> name id source component component_id object hello
#> <chr> <chr> <chr> <chr> <chr> <list> <int>
#> 1 trees trees list unknown unknown <nparam[+]> 1
#> 2 trees trees list unknown unknown <nparam[+]> 2
#> 3 trees trees list unknown unknown <nparam[+]> 3
#> 4 min_n min_n list unknown unknown <nparam[+]> NA
#> 5 sample_size sample_size list unknown unknown <nparam[?]> NA
## this one surprises and concerns me: STILL PARAMETERS
trees_param %>% mutate(id = paste0(id, "_better"),
name = paste0(name, "_best"))
#> Collection of 3 parameters for tuning
#>
#> id parameter type object class
#> trees_better trees_best nparam[+]
#> min_n_better min_n_best nparam[+]
#> sample_size_better sample_size_best nparam[?]
#>
#> Parameters needing finalization:
#> # Observations Sampled ('sample_size_better')
#>
#> See `?dials::finalize` or `?dials::update.parameters` for more information.
Created on 2020-05-20 by the reprex package (v0.3.0)
library(tidymodels)
#> ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.6 ✓ recipes 0.1.12
#> ✓ dials 0.0.6 ✓ rsample 0.0.6
#> ✓ dplyr 0.8.99.9003 ✓ tibble 3.0.1
#> ✓ ggplot2 3.3.0 ✓ tune 0.1.0
#> ✓ infer 0.5.1 ✓ workflows 0.1.1
#> ✓ parsnip 0.1.1.9000 ✓ yardstick 0.0.6
#> ✓ purrr 0.3.4
#> ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step() masks stats::step()
trees_param <- parameters(trees(), min_n(), sample_size())
basic_af <- tibble(id = rep("trees", 3), hello = 1:3)
## look good
trees_param %>% mutate(hello = 1)
#> # A tibble: 3 x 7
#> name id source component component_id object hello
#> <chr> <chr> <chr> <chr> <chr> <list> <dbl>
#> 1 trees trees list unknown unknown <nparam[+]> 1
#> 2 min_n min_n list unknown unknown <nparam[+]> 1
#> 3 sample_size sample_size list unknown unknown <nparam[?]> 1
trees_param %>% arrange(id)
#> Collection of 3 parameters for tuning
#>
#> id parameter type object class
#> min_n min_n nparam[+]
#> sample_size sample_size nparam[?]
#> trees trees nparam[+]
#>
#> Parameters needing finalization:
#> # Observations Sampled ('sample_size')
#>
#> See `?dials::finalize` or `?dials::update.parameters` for more information.
trees_param %>% select(id)
#> # A tibble: 3 x 1
#> id
#> <chr>
#> 1 trees
#> 2 min_n
#> 3 sample_size
anti_join(trees_param, basic_af, by = "id")
#> Collection of 2 parameters for tuning
#>
#> id parameter type object class
#> min_n min_n nparam[+]
#> sample_size sample_size nparam[?]
#>
#> Parameters needing finalization:
#> # Observations Sampled ('sample_size')
#>
#> See `?dials::finalize` or `?dials::update.parameters` for more information.
right_join(trees_param, basic_af, by = "id")
#> # A tibble: 3 x 7
#> name id source component component_id object hello
#> <chr> <chr> <chr> <chr> <chr> <list> <int>
#> 1 trees trees list unknown unknown <nparam[+]> 1
#> 2 trees trees list unknown unknown <nparam[+]> 2
#> 3 trees trees list unknown unknown <nparam[+]> 3
full_join(trees_param, basic_af, by = "id")
#> # A tibble: 5 x 7
#> name id source component component_id object hello
#> <chr> <chr> <chr> <chr> <chr> <list> <int>
#> 1 trees trees list unknown unknown <nparam[+]> 1
#> 2 trees trees list unknown unknown <nparam[+]> 2
#> 3 trees trees list unknown unknown <nparam[+]> 3
#> 4 min_n min_n list unknown unknown <nparam[+]> NA
#> 5 sample_size sample_size list unknown unknown <nparam[?]> NA
left_join(trees_param, basic_af, by = "id")
#> # A tibble: 5 x 7
#> name id source component component_id object hello
#> <chr> <chr> <chr> <chr> <chr> <list> <int>
#> 1 trees trees list unknown unknown <nparam[+]> 1
#> 2 trees trees list unknown unknown <nparam[+]> 2
#> 3 trees trees list unknown unknown <nparam[+]> 3
#> 4 min_n min_n list unknown unknown <nparam[+]> NA
#> 5 sample_size sample_size list unknown unknown <nparam[?]> NA
## this one also surprises and concerns me: STILL PARAMETERS
trees_param %>% mutate(id = paste0(id, "_better"),
name = paste0(name, "_best"))
#> Collection of 3 parameters for tuning
#>
#> id parameter type object class
#> trees_better trees_best nparam[+]
#> min_n_better min_n_best nparam[+]
#> sample_size_better sample_size_best nparam[?]
#>
#> Parameters needing finalization:
#> # Observations Sampled ('sample_size_better')
#>
#> See `?dials::finalize` or `?dials::update.parameters` for more information.
Created on 2020-05-20 by the reprex package (v0.3.0)
Unfortunately I think that one is going to have to be allowed. The internal check makes sure that the types of the columns are still as expected (i.e. character), and ensures that id
isn't duplicated, but it doesn't check anything else about the data in the columns.
We can't check that the columns are exactly the same before and after a transformation, because we want things like df[1:2,]
to work and still return a parameters df.
There is currently no way to say: "The data in a column was X before the transform, now after the transform it is Y. Now check that they are still compatible with each other".
The no duplicate check in id
works because it doesn't rely on the data before the transformation in any way.
OK, makes sense, I am clearer now on what kind of changes in columns are being checked. ✅
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
This PR adds vctrs and dplyr 1.0.0 support for the dials
parameters
subclass. I will do the param-grid one in a separate PR.This currently relies on dev vctrs because of a bug in
vec_rbind()
that had to be fixed.It works very similarly to https://github.com/tidymodels/rsample/pull/150 and revolves around the following four helpers (which also exist in rsample):
parameters_reconstructable(x, to)
- Isx
reconstructable toto
?parameters_strip(x)
- Strip off the parameters related attributes ofx
parameters_reconstruct(x, to)
- Reconstructx
into a parameters object using the attributes ofto
parameters_maybe_reconstruct(x, to)
- Strips if not reconstructable, otherwise it reconstructsThe first three helpers seem to be the common pieces when creating a tibble subclass (the 4th is just a combination of the other 3). The dplyr and vctrs methods use them to implement casting, common types, restoration, etc.
x
is reconstructable to a parameters object,to
, if the following invariants hold when comparing the two objects:These rules are somewhat different from rsample.
The setup is generally the same as with rsample, except for the fact that rsets were special because they had a tibble ptype (taking a 0-row slice fell back to tibble). That isn't the case here, taking a 0-row slice of a parameters tibble is still a parameters tibble. This is more inline with how vctrs extensions usually work, and we actually have
vec_ptype2()
methods that get used this time.The common type of
<parameters>
with<parameters>
is another<parameters>
. This allows you tovec_c()
andvec_rbind()
multiple parameters objects and retain the parameters class as long as the result doesn't break any of the class invariants outlined above.The common type of
<parameters>
with<data.frame>
or<tibble>
is a<tibble>
. Combining a parameters object with a tibble should always result in a new bare tibble. This just makes the most sense since you can no longer restrict the columns to only the parameters related columns.This PR also removes the
info
attribute from param-grids