rstudio / tfestimators

R interface to TensorFlow Estimators
https://tensorflow.rstudio.com/tfestimators
57 stars 21 forks source link

'predict()' should marshal return values into nicer data structures when appropriate #49

Closed kevinushey closed 6 years ago

kevinushey commented 7 years ago

For example, given this code:

# define feature columns
  .feature_columns <- feature_columns(
    cyl = column_numeric("cyl"),
    drat = column_numeric("drat")
  )

  .input_fn <- input_fn(
    mtcars,
    response = "mpg",
    features = names(columns)
  )

  # try fitting a linear regression model with tfestimators
  tf_model <-
    linear_regressor(feature_columns = unname(.feature_columns)) %>%
    train(input_fn = .input_fn)

  predict(tf_model, input_fn = .input_fn)

The output is of the form:

> predict(tf_model, input_fn = .input_fn) %>% str()
List of 32
 $ :List of 1
  ..$ predictions: num [1(1d)] 4.99
 $ :List of 1
  ..$ predictions: num [1(1d)] 4.23
< ... >

Compare this to base R, where you receive a named numeric vector:

> model <- lm(mpg ~ cyl + drat, data = mtcars)
>   predict(model) %>% str()
 Named num [1:32] 21.1 21.1 26 19.6 14.8 ...
 - attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
terrytangyuan commented 7 years ago

Yeah but the problem is that if it's custom estimator this would be really arbitrary since users can customize what's in prediction, usually a nested list with items like raw predicted classes and probabilities. This maybe related to the pending PR in this repo for print S3 method.

kevinushey commented 7 years ago

But we're not talking about custom estimators here -- the canned estimators we provide in this package should produce outputs that are easy to interact with.

For users who want to develop custom estimators, we should provide tooling that can help users ensure that numeric vectors are returned (when that is what is desired) rather than lists of lists.

terrytangyuan commented 7 years ago

I felt like it's better to keep them consistent but I am open to any proposal you come up with for this!

jjallaire commented 7 years ago

I agree w/ Kevin here. For canned estimators we should always produce/print something useful. For custom estimators I think users should provide an S3 class for their estimator (so the classes would be e.g. c("tf_custom_model", "tf_my_model") and then write a str method for it (and we can provide helper functions to make this easier).

On Tue, Jul 11, 2017 at 11:56 PM, Yuan (Terry) Tang < notifications@github.com> wrote:

I felt like it's better to keep them consistent but I am open to any proposal you come up with for this!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rstudio/tfestimators/issues/49#issuecomment-314637723, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx--dPdGJ6FvgfPyNuCmRvh-zpFKtks5sNEP6gaJpZM4OU5EN .

terrytangyuan commented 7 years ago

I agree it's helpful but even for canned estimators, the prediction will look different when you pass different prediction_keys()to predict(), e.g. predict(..., predict_keys = prediction_keys()$PROBABILITIES) (all available keys can be found here or just print prediction_keys()). They maybe an nested or non-nested list.

So we'd likely have to maintain a map from keys to the extracting method. Also there will be at least one other loop that we need to do to iterate through predictions, which may not be ideal (we can have give an additional argument with some default value for this though).

terrytangyuan commented 7 years ago

Also you can pass multiple keys to that predict_keys arguments. The original Python API was designed in a way so that the performance is the best - the values for those keys can be extracted in each single iteration.

kevinushey commented 7 years ago

I don't think we want to model the TensorFlow Estimator.predict interface in the top-level method we implement for R's predict() function. The methods we provide for these generics should return data in a format that is easily used within R. This generally implies that results will be returned either as a plain old vector (when only one column is desired, and which is the most common case), or perhaps as a matrix or data.frame of values (e.g. when multiple categories are predicted).

Power-users who do indeed want the list-of-list representation are free to manually call the estimator$predict() method themselves, but at least for these top-level generics we should do a bit of extra work to produce data in the format R users generally expect.

jjallaire commented 7 years ago

Or maybe we could return the list of lists as an attribute of what we return from predict() (we are really trying to avoid having any $ calls in the API).

On Wed, Jul 12, 2017 at 11:52 AM, Kevin Ushey notifications@github.com wrote:

I don't think we want to model the TensorFlow Estimator.predict interface in the top-level method we implement for R's predict() function. The methods we provide for these generics should return data in a format that is easily used within R. This generally implies that results will be returned either as a plain old vector (when only one column is desired, and which is the most common case), or perhaps as a matrix or data.frame of values (e.g. when multiple categories are predicted).

Power-users who do indeed want the list-of-list representation are free to manually call the estimator$predict() method themselves, but at least for these top-level generics we should do a bit of extra work to produce data in the format R users generally expect.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rstudio/tfestimators/issues/49#issuecomment-314813503, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx7UPfHYwn9evXTtKM4ZnGb60--MAks5sNOvGgaJpZM4OU5EN .

kevinushey commented 7 years ago

It looks like the return type here has explicitly changed between tf.estimator.Estimator and tf.contrib.learn.Estimator.

The 'contrib.learn' version is documented thusly:

Returns:

A numpy array of predicted classes or regression values if the constructor's model_fn returns a Tensor for predictions or a dict of numpy arrays if model_fn returns a dict. Returns an iterable of predictions if as_iterable is True.

So the intention was (at least some time previously) for this to return a numpy array, or a generator which could be used to populate such an array. Yet the newer Estimator class eschews the as_iterable argument entirely, and seems to provide a generator that produces data in an awkward format:

Browse[2]> builtins$print(output) [{'predictions': array([ 1.37737417], dtype=float32)}, {'predictions': array([ 1.63807189], dtype=float32)}, {'predictions': array([ 1.71594262], dtype=float32)}, {'predictions': array([ 1.66177177], dtype=float32)}, {'predictions': array([ 1.37737417], dtype=float32)}, {'predictions': array([ 1.42138815], dtype=float32)}, {'predictions': array([ 1.37398851], dtype=float32)}, {'predictions': array([ 1.56020117], dtype=float32)}, {'predictions': array([ 1.76334214], dtype=float32)}, {'predictions': array([ 1.65500033], dtype=float32)}, {'predictions': array([ 1.76334214], dtype=float32)}, {'predictions': array([ 1.37398851], dtype=float32)}, {'predictions': array([ 1.32658899], dtype=float32)}, {'predictions': array([ 1.53311574], dtype=float32)}, {'predictions': array([ 2.00372577], dtype=float32)}, {'predictions': array([ 1.72609973], dtype=float32)}, {'predictions': array([ 1.40107405], dtype=float32)}, {'predictions': array([ 1.37398851], dtype=float32)}, {'predictions': array([ 1.26903236], dtype=float32)}, {'predictions': array([ 1.61098647], dtype=float32)}, {'predictions': array([ 1.5974437], dtype=float32)}, {'predictions': array([ 1.58390105], dtype=float32)}, {'predictions': array([ 1.66177177], dtype=float32)}, {'predictions': array([ 1.35028875], dtype=float32)}, {'predictions': array([ 1.42815948], dtype=float32)}, {'predictions': array([ 1.40107405], dtype=float32)}, {'predictions': array([ 1.66177177], dtype=float32)}, {'predictions': array([ 1.65500033], dtype=float32)}, {'predictions': array([ 1.58728671], dtype=float32)}, {'predictions': array([ 1.71594262], dtype=float32)}, {'predictions': array([ 1.83444154], dtype=float32)}, {'predictions': array([ 1.26903236], dtype=float32)}]

That is, the generator literally returns a whole bunch of dictionaries of length-one arrays. This seems very inefficient to me -- we'd much rather have a dictionary with a single array of the required length.

@terrytangyuan, can you provide some context here? Is there something I'm missing that would allow us to just get the predicted values back as a single array?

kevinushey commented 7 years ago

Is it possible that the shape of the predicted values here is a consequence of the input function we're defining + using?

terrytangyuan commented 7 years ago

How did you get that? If you just run predict() you'll get list of lists. Those seem like raw Python prediction output .

terrytangyuan commented 7 years ago

Are you asking why it's designed this way in Python? In this simple case, dictionaries of length-one arrays do seem non-intuitive. However, it's the best we can do to capture the needs from Google internally so that users can construct different types of models.

To address your questions:

This might take a while to digest but please let me know if you have any questions!

terrytangyuan commented 7 years ago

This as_iterable is something that was in contrib.learn but I deliberately kept it here and made it work with tf.estimator so that users don't have to deal with dictionaries of arrays in Python but list of named lists in R.

kevinushey commented 7 years ago

Here's the full set of code I was running (I was explicitly trying to see what exactly we were receiving in Python, to see if we were just marshaling things back to R in a weird way):

devtools::load_all()

specs <- mtcars_regression_specs()

tf_model <- linear_regressor(feature_columns = specs$linear_feature_columns) %>%
  train(input_fn = specs$input_fn, steps = 2)

tf_coef <- coef(tf_model)

predictions <- tf_model$estimator$predict(
  input_fn = normalize_input_fn(tf_model, specs$input_fn)
)

builtins <- reticulate::import_builtins(convert = FALSE)
resolved <- builtins$list(predictions)
builtins$print(resolved)

I was surprised to see the returned value being a list of dictionaries of length-one arrays -- this seems both cumbersome to work with + a potential waste of memory, but obviously I say this as someone without any real experience in the TensorFlow universe. (Do you by any chance have any example TensorFlow applications that do something interesting with the generated predicted values? I'm curious to see what such things look like in the Python world)

I can see why it's nice to keep things uniform in general, but I think it's relatively well understood at least in the R universe that, when one fits a regression model (where the response is a continuous variable), then the predicted values should also just be a plain old array of values. I think, at least for the canned estimators we expose in this package that have well-understood R counterparts, that we should endeavor to have these models produce outputs familiar to R users when accessed through the R 'model' APIs (predict(), residuals(), and so on)

terrytangyuan commented 7 years ago

Yeah I agree it might be a waste of memory but it's the best we can do to support all those types of models. Well, if people have problem store just the predictions, then they would have more trouble with other parts of their analytics workflow. Predictions are usually the smallest already. Memory/storage is way cheaper than other computational resources nowadays.

People in the R universe have to always keep an open mind to new solutions and challenges (R people in the industries are much more open-minded actually). Otherwise they would really lose their competitive advantages. There are all sorts of things/tricks people do to make models more predictive as well as more interpretable, which involves to also output different meta data generated during prediction phase, not just the raw predictions (e.g. single continuous response variable).

jjallaire commented 7 years ago

I wonder if we do some patching of the results to preserve the original list but at the same time provide a field which has the expected plain old array of values.

I am sympathetic to both viewpoints here. In general, I don't think it's likely we are going to be able to "sell" TensorFlow to R users that are already perfectly happy using standard R statistical models on in-memory data. To some extent users need to know why they might want to use TensorFlow before we will get the time of day from them. If a user is happy with lm on a local data frame then as soon as you start talking to them about input functions, feature columns, etc. they'll be lost anyway!

So we shouldn't over-cater by making the more scalable TensorFlow environment appear to be as simple as base R.

All of that said, R users more than Python users expect a lot of of the handling of data to "just work" in a sensible way without extra effort so I think every time we see results or output that users are likely to scratch their head at (this is one of those instances) I think we need to ask ourselves how we can make it better.

On Thu, Jul 13, 2017 at 9:36 PM, Kevin Ushey notifications@github.com wrote:

Here's the full set of code I was running (I was explicitly trying to see what exactly we were receiving in Python, to see if we were just marshaling things back to R in a weird way):

devtools::load_all()

specs <- mtcars_regression_specs()

tf_model <- linear_regressor(feature_columns = specs$linear_feature_columns) %>% train(input_fn = specs$input_fn, steps = 2)

tf_coef <- coef(tf_model)

predictions <- tf_model$estimator$predict( input_fn = normalize_input_fn(tf_model, specs$input_fn) )

builtins <- reticulate::import_builtins(convert = FALSE) resolved <- builtins$list(predictions) builtins$print(resolved)

I was surprised to see the returned value being a list of dictionaries of length-one arrays -- this seems both cumbersome to work with + a potential waste of memory, but obviously I say this as someone without any real experience in the TensorFlow universe. (Do you by any chance have any example TensorFlow applications that do something interesting with the generated predicted values? I'm curious to see what such things look like in the Python world)

I can see why it's nice to keep things uniform in general, but I think it's relatively well understood at least in the R universe that, when one fits a regression model (where the response is a continuous variable), then the predicted values should also just be a plain old array of values.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rstudio/tfestimators/issues/49#issuecomment-315245593, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx-G8aB2Xcp1i3PoKio-wDdsdXa_Pks5sNsYZgaJpZM4OU5EN .

kevinushey commented 7 years ago

I've come around to the idea that we should avoid doing anything special here with the return value of predict() -- instead, we should just provide ample documentation / examples as to how one could convert the returned predicted values from lists to vectors as desired.

terrytangyuan commented 7 years ago

Okay but is there anything that makes you change your mind that we should be aware of?

kevinushey commented 7 years ago

Here's the main ideas I had in my head (some of which you already emphasized):

  1. Other APIs that accept predicted values from TensorFlow estimators will likely be expecting the values in the current format; deviating from that would be frustrating for users.

  2. Doing this for one canned estimator implicitly signs us up for doing this for the rest of canned estimators, which is a bit of an extra maintenance burden.

  3. It's quite simple to extract the predicted values into a single vector if desired; usually with something like vapply(predicted,[[, "key", numeric(1)) or purrr::map_dbl(predicted,[[, "key") so this is something we could just solve with good examples / documentation.

  4. It's easier to experiment with different canned estimators if the format of outputs is the same across the estimators.

jjallaire commented 7 years ago

If we don't want to change the return value of predict (which given the discussion in this thread is certainly a reasonable position) I think we need a suite of functions that can clean up predict output from canned estimators for easy inspection and visualization. Every internal user that has tried to use tfestimators runs into the issue of predict output immediately and ends up hand rolling a solution.

Alternatively, we could have predict(..., as_list = FALSE) and in the case of as_list = FALSE the default have a per-estimator-type transform/cleaner function that runs. That way the "out of the box" experience yields the kind of data structures R users expect and can easily visualization but at the same time we document that passing as_list = TRUE gives you the unfiltered output from predict.

@kevinushey @terrytangyuan @edgararuiz Could you imagine easily what as_list = FALSE might yield for some of the canned estimators?

terrytangyuan commented 7 years ago

Yeah I totally agree.

For example, the following (just fixed one thing at HEAD so please pull):

specs <- mtcars_classification_specs() # from testthat/helper-utils.R
  clf <-
    dnn_linear_combined_classifier(
      linear_feature_columns = specs$linear_feature_columns,
      dnn_feature_columns = specs$dnn_feature_columns,
      dnn_hidden_units = c(3L, 3L),
      dnn_optimizer = "Adagrad"
    ) %>% train(input_fn = specs$input_fn)
predictions <- predict(clf, input_fn = specs$input_fn, predict_keys = list("classes", "probabilities"))

would give you a list of items where each item looks like this:

[[32]]
[[32]]$probabilities
[1] 0.5061268 0.4938732

[[32]]$classes
[1] "0"

So it looks pretty easy to wrap the results from the existing canned estimators - linear and dnn.

What I am uncertain about is prediction output of the time series estimators (in contrib) that I am helping to move to core. But it's going to be a while until those are available. We'll worry about those later.

I also think that we should expose the wrapper function that cleans the predictions so users of custom estimator may want to use this sometimes as well if the format of their predictions is simple (guess 70% of the time since in most of the cases users will be only playing with different layers instead of the prediction format).

jjallaire commented 7 years ago

How about we add a simplify argument (sort of like the apply family of R functions) and have it default to "auto". We could always simplify things that are clearly simple vector data and do some other auto-simplification for canned estimators. If in the process of doing this we develop some functions of more general applicability we can export them.

jjallaire commented 7 years ago

Here is a more formal proposal:

@terrytangyuan Could you take a crack at this as a PR?

terrytangyuan commented 7 years ago

Thanks for the detailed proposal! Yes, I'll tackle this soon!

On Mon, Oct 9, 2017 at 8:58 AM JJ Allaire notifications@github.com wrote:

Here is a more formal proposal:

-

predict(..., simplify = TRUE) by default. simplify can be either TRUE (simplify if possible), FALSE (never simplify) or a custom function that takes a list and performs a simplification.

We implement simplifiers for all of the canned estimators for which this is practical

We provide a hook for custom estimators to provide a simplifier (perhaps this is attached to model_fn, perhaps via an S3 method)

We export any relevant helper functions used in simplification (we may or may not have good candidates for this, just something to consider).

@terrytangyuan https://github.com/terrytangyuan Could you take a crack at this as a PR?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/rstudio/tfestimators/issues/49#issuecomment-335166216, or mute the thread https://github.com/notifications/unsubscribe-auth/AEEnSq9FV7GNEEdignSX1W6O72f2F1Gpks5sqiZkgaJpZM4OU5EN .