tidymodels / hardhat

Construct Modeling Packages
https://hardhat.tidymodels.org
Other
101 stars 16 forks source link

Issue in `hardhat:::alter_formula_environment()` #180

Closed ddsjoberg closed 2 years ago

ddsjoberg commented 2 years ago

Hello!

I've encountered an issue using hardhat::mold() originating from hardhat:::alter_formula_environment(). I've created a reprex below illustrating the issue.

The hardhat:::alter_formula_environment() updates the formula's environment to the parent of the global environment. But if the formula contains any functions that are not available in the updated search path, we get an error.

hh_alter_formula_environment_error <- function() {

  silly_identity <- identity
  formula <- mpg ~ silly_identity(am)
  data <- mtcars

  # this works
  message("Testing `model.frame()`")
  model.frame(formula, data) |> head() |> print()

  # this does not work
  message("Testing `hardhat::mold()`")
  hardhat::mold(
    formula,
    data,
    blueprint = hardhat::default_formula_blueprint(indicators = "none")
  )
}

hh_alter_formula_environment_error()
#> Testing `model.frame()`
#>                    mpg silly_identity(am)
#> Mazda RX4         21.0                  1
#> Mazda RX4 Wag     21.0                  1
#> Datsun 710        22.8                  1
#> Hornet 4 Drive    21.4                  0
#> Hornet Sportabout 18.7                  0
#> Valiant           18.1                  0
#> Testing `hardhat::mold()`
#> Error in eval_bare(node_car(expr), env): object 'silly_identity' not found

Created on 2022-01-02 by the reprex package (v2.0.1)

In my particular case, I am using the survival::Surv() on the LHS of the formula in a package. That package is being used in another package and the Surv() function is being lost in the search path. This won't be an issue for me once we can use :: on the LHS of a formula (fixed in #173, THANK YOU!), because I can specify survival::Surv() on the LHS instead of Surv() removing any ambiguity where the function lives.

DavisVaughan commented 2 years ago

I am fairly certain that we want to keep this behavior. The purpose of it is to prevent users from putting global variables in their formula, as it is highly likely that these global variables will be available at mold() (model fit) time, but not at forge() (model prediction) time - especially if they save their models and reload them at a later time for prediction.

For inline functions, generally these come from a pre-existing package, so the best practice is to use the package namespace prefix like survival::Surv() to specify them in the formula, which ensures that Surv() is available at mold and forge time too.

I have a feeling you have probably done @importFrom survival Surv in your package, and expected Surv() to work. I think this is reasonable, but the environment chain that we currently look for inline functions in only contains attached packages, i.e. ones that have been library()d, which isn't the case here with survival (it is just loaded). While it may be possible to improve on this, I think that your case is somewhat rare VS what an end user would do. So I think my best advice to you is to prefix with survival:: now that that works properly.

I documented this a little more in 33d23ab

ddsjoberg commented 2 years ago

Thank you so much for the response! You're exactly right about the use...the package was attached, not loaded, leading to the error. Thanks to this (#173) update, it won't be an issue! 🕺🏼

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.