rstudio / tfestimators

R interface to TensorFlow Estimators
https://tensorflow.rstudio.com/tfestimators
57 stars 21 forks source link

Best practices for tfestimators #136

Open dfalbel opened 6 years ago

dfalbel commented 6 years ago

I'm working on an end-to-end example for tfestimators but I'm still having trouble to understand best practices.

Consider the Wide & Deep Example available here

We start by defining all categorical values and their possible values. Eg.

gender <- column_categorical_with_vocabulary_list(
  "gender", vocabulary_list = c("Female", "Male"))
education <- column_categorical_with_vocabulary_list(
  "education",
  vocabulary_list = c(
    "Bachelors", "HS-grad", "11th", "Masters", "9th",
    "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
    "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
    "Preschool", "12th"))
marital_status <- column_categorical_with_vocabulary_list(
  "marital_status",
  vocabulary_list = c(
    "Married-civ-spouse", "Divorced", "Married-spouse-absent",
    "Never-married", "Separated", "Married-AF-spouse", "Widowed"))
relationship <- column_categorical_with_vocabulary_list(
  "relationship",
  vocabulary_list = c(
    "Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
    "Other-relative"))
workclass <- column_categorical_with_vocabulary_list(
  "workclass",
  vocabulary_list = c(
    "Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
    "Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"))

This is ok for small datasets, but for datasets with more variables this would be a long work. Of course we can create some code to automate this, for example:

library(dplyr)
library(purrr)

cat_vocab <- train_data %>% 
  select(education, marital_status, relationship, workclass) %>%
  map2(names(.), ., ~column_categorical_with_vocabulary_list(.x, vocabulary_list = unique(.y)))

I think this will be very common, so we could add a function to do exactly this. We can do similar for other variables:

cat_hash <- train_data %>%
  select(occupation, native_country) %>%
  names() %>%
  map(~column_categorical_with_hash_bucket(.x, hash_bucket_size = 1000))

num <- train_data %>%
  select(age, education_num, capital_gain, capital_loss, hours_per_week) %>%
  names() %>%
  map(column_numeric)

columns <- c(cat_vocab, cat_hash, num)

Now, suppose I wan't to train a logistic regression model. I would run:

model <- linear_classifier(columns)

# Build labels according to income bracket
train_data$income_bracket <- as.character(train_data$income_bracket)
test_data$income_bracket <- as.character(test_data$income_bracket)
train_data$label <- ifelse(train_data$income_bracket == ">50K", 1, 0)
test_data$label <- ifelse(test_data$income_bracket == ">50K", 1, 0)

constructed_input_fn <- function(dataset, epochs = 1) {
  input_fn(dataset, features = -label, response = label, num_epochs = epochs)
}

train_input_fn <- constructed_input_fn(train_data, 10)
eval_input_fn <- constructed_input_fn(test_data, 1)

train(model, train_input_fn)

This will return: [/] Training -- loss: 24.30, step: 2544

Here, I don't understand the loss - is this the final loss or the last processed batch loss.

Then I use the evaluate() function to get model results on the test data.

res <- evaluate(model, input_fn = eval_input_fn)
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
[/] Evaluating -- loss: 24.10, step: 128

1) I didn't find why tensor flow outputs those warnings... 2) Printed loss doesn't seems to be the final loss since running evaluate again returns different results 3) The results table below. It seems that tensorflow is misunderstanding the labels since accuracy_baseline is 1 and label/mean is 0. I tried passing the labels as booleans, but got the same result.

# A tibble: 1 x 9
      loss accuracy_baseline global_step   auc `prediction/mean` `label/mean` average_loss auc_precision_recall  accuracy
     <dbl>             <dbl>       <dbl> <dbl>             <dbl>        <dbl>        <dbl>                <dbl>     <dbl>
1 185.0804                 1        2544     1         0.2481793            0     1.455088                    0 0.7742768

That's it! Let me know if I'm doing something wrong!!

terrytangyuan commented 6 years ago

@dfalbel Have you tried pattern matching for feature columns?

dfalbel commented 6 years ago

yes, but i didn't figure out how to use it with categorical variables since we need to specify a different vocabulary for each variable.

Em qui, 11 de jan de 2018 17:01, Yuan (Terry) Tang notifications@github.com escreveu:

@dfalbel https://github.com/dfalbel Have you tried pattern matching for feature columns https://tensorflow.rstudio.com/tfestimators/articles/feature_columns.html#pattern-matching ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rstudio/tfestimators/issues/136#issuecomment-357027558, or mute the thread https://github.com/notifications/unsubscribe-auth/AEfSBnEh4u3f6lh9izZhLsuLJk8112a5ks5tJlqbgaJpZM4RbLfn .

terrytangyuan commented 6 years ago

@dfalbel I am trying to understand what you cannot achieve using pattern matching. Could you elaborate a bit? As for your other questions, there might be something wrong in the Python API. Could you try running their official wide and deep example in Python to see if the results are similarly problematic?

dfalbel commented 6 years ago

@terrytangyuan suppose I have this df:

library(tfestimators)

df <- data_frame(
  x1_cat = sample(letters, 100, replace = TRUE),
  x2_cat = sample(LETTERS, 100, replace = TRUE),
  x3_cat = sample(c(letters, LETTERS), 100, replace = TRUE),
  x4_num = runif(100),
  x5_num = rnorm(100)
)

I can use pattern matching to create feature columns or numeric vars, eg:

cols <- with_columns(df, {
  feature_columns(
    column_numeric(ends_with("num"))
  )
})

But, for categorical columns I can't because I need to pass a different vocabulary for each column. For example:

cols <- with_columns(df, {
  feature_columns(
    column_numeric(ends_with("num")),
    column_categorical_with_vocabulary_list(ends_with("cat"))
  )
})
Error in py_resolve_dots(list(...)) : 
  argument "vocabulary_list" is missing, with no default

I'll run the wide & deep example from python as soon as possible too.

terrytangyuan commented 6 years ago

Oh I see. There isn’t existing good practice for that yet so feel free to submit PR to show an example, e.g. using map2() and unique().

jjallaire commented 6 years ago

I am going to submit a CRAN update to tfestimators soon (next few days) and just wanted to check in here to make sure there aren't any changes/fixes needed as a result of this thread before I do that.

cj-wilson commented 6 years ago

Did any updated examples get put into a vignette somewhere? This is very similar to a problem I have where I'm trying to convert a factor variable to one hot encoded values but the dnn_classifier takes neither a straight indicator or categorical_with_vocabulary_list.