House Prices in King County Example

I have two comments regarding this example:

Under the section titled "Engineering Features: Mutate Zip Codes" the house price is being regressed against its own median value in each zip code, obviously resulting in a lower RMSE. This was to be expected as the price and its group median are correlated while the zipcode remains present among features (regressors).

I would have thought that the median price becomes the new target in the task_train and task_test while the house price is removed from features list and the impact-coded zipcode replaces the high-cardinality zipcode factor.

Q: Would it be a huge mistake if the median price and/or the impact-coding of the zipcode were carried out in the kc_housing dataset instead at task level? Information leak occurs anyway.

I think this comment relates to the Task$new() instance in general. The object

task_train$data() is of the data.table class.

Yet, the following, short and perfectly legit operation does not preserve the med_price variable:

task_train$data()[, med_price := median(price), by = 'zipcode']

and instead, the contorted workaround presented in Example is needed.

Q: What invalidates the above operation in mlr3?

Please advise, thank you!

Hi, thanks for the comment

Q1: I think you are actually right, and the way I presented this is incorrect. The correct way would be to compute the median only for the available training data. I will correct this!

Q2: task_train$data()[, med_price := median(price), by = 'zipcode'] does not work because the data() active binding does not reference a data.table object in memory, but rather points to a DataBackend. Thus in-place manipulation using := will not work. (The AB returns a data.table but it is not a data.table).

@pfistfl

Hello Florian,

Thank you for reply!

Regarding:

A1: I am a bit confused by your answer because I thought that zip code median in training data exclusively is what you have in the example.

Consider zip code median on test data (which could be the price itself in case of leave-one-out cross-validation or, grouped median price otherwise. The prediction error would not be as small but the bias would improve, in my opinion.

As for the impact coding, I doubt that adding the median price to the list of predictors for price is a valid statistical practice for improving model prediction as it adds no new information being just a lookup variable for house price, as you have proven with your geographic map.

With your permission I would mention a good discussion about this technique and pitfalls for using high degrees-of-freedom regressors posturing as simple variables here.

A2: As a data.table fan I think is really unfortunate to give up data.table's multiple advantages; I think the operations within data.table - including joins - should stand encapsulated within R6 objects.

Re A1: I somewhat agree with the link you posted about impact encoding and "high degrees-of-freedom" regressors, though I am not sure if for cross-validation it is as-important as in stacking. Currently, cross-validation is currently only implemented for stacked learners and not for categorical encoding, although, the latter might be implemented at a later point. An often-used technique on kaggle is to add noise (c.f. this reddit post), which would be faster & easier to implement.

Re A2: This is true, but mlr3's current design does not allow this as the data is abstracted away in the task, which allows us to include several data backends (DataBases, data.table, matrix, data generators). For standard pre-processing/feature engineering, a good option would be to pre-process data before putting it inside a task or to use the Tasks cbind / rbind methods. A benefit of the current abstraction is, that I can cbind data from several data sources seamlessly, and as a result, we can for example store engineered features from data that resides in a data base in a data.table next to it.

@pfistfl

Hello Florian,

Thank you for the reply!

Re: A2:

I wonder if task_train (task_test as well) could be updated using data.table operations performed outside the task, on the train ( test) subset. Either automatically or by calling TaskRegr$new() again after feature engineering is performed.

Once I find some time today I will test this and let you know.

UPDATE1: Now I see that apparently I am stealing your idea from your previous post. What I had in mind however (and didn't explicitly say!) referred to subsequent feature engineering operations that may occur after task instantiation.

Keeping the train/test datasets as data sources for train_task/test_task backends would allow data.table operations to be carried out on datasets as you have mentioned. This may work well with automated feature engineering, for example through deep NN (mlr2keras with TabNet which I have yet to try!).

UPDATE2: I have tested the changes in the initial task against changes resulted from feature engineering in data source for task backend as follows:

as.data.table(kc_housing)

tsk <- TaskRegr$new('sales', backend = kc_data, target = 'price')

tsk$data()[, med_price := median(price), by = zipcode]

as before, no med_price present in backend:

head(tsk$data())
    price bathrooms bedrooms condition       date floors grade     lat     long sqft_above sqft_basement
1:  221900      1.00        3         3 2014-10-13      1     7 47.5112 -122.257       1180            NA
2:  538000      2.25        3         3 2014-12-09      2     7 47.7210 -122.319       2170           400
3:  180000      1.00        2         3 2015-02-25      1     6 47.7379 -122.233        770            NA
4:  604000      3.00        4         5 2014-12-09      1     7 47.5208 -122.393       1050           910
5:  510000      2.00        3         3 2015-02-18      1     8 47.6168 -122.045       1680            NA
6: 1225000      4.50        4         3 2014-05-12      1    11 47.6561 -122.005       3890          1530
   sqft_living sqft_living15 sqft_lot sqft_lot15 view waterfront yr_built yr_renovated zipcode
1:        1180          1340     5650       5650    0      FALSE     1955           NA   98178
2:        2570          1690     7242       7639    0      FALSE     1951         1991   98125
3:         770          2720    10000       8062    0      FALSE     1933           NA   98028
4:        1960          1360     5000       5000    0      FALSE     1965           NA   98136
5:        1680          1800     8080       7503    0      FALSE     1987           NA   98074
6:        5420          4760   101930     101930    0      FALSE     2001           NA   98053

Perform all feature changes presented in Example inside data source itself and using data.table operations:

kc_housing[, med_price := median(price), by = zipcode
         ][, c('renovated', 'has_basement', 'price') := .(as.numeric(is.na(yr_renovated)), as.numeric(is.na(sqft_basement)), price/1000)
         ][, c('yr_renovated', 'sqft_basement') := NULL]

As result kc_housing now includes _medprice, renovated and _hasbasement and excludes _yrrenovated and _sqftbasement

head(kc_housing)

         date  price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above
1: 2014-10-13  221.9        3      1.00        1180     5650      1      FALSE    0         3     7       1180
2: 2014-12-09  538.0        3      2.25        2570     7242      2      FALSE    0         3     7       2170
3: 2015-02-25  180.0        2      1.00         770    10000      1      FALSE    0         3     6        770
4: 2014-12-09  604.0        4      3.00        1960     5000      1      FALSE    0         5     7       1050
5: 2015-02-18  510.0        3      2.00        1680     8080      1      FALSE    0         3     8       1680
6: 2014-05-12 1225.0        4      4.50        5420   101930      1      FALSE    0         3    11       3890
   yr_built zipcode     lat     long sqft_living15 sqft_lot15 med_price renovated has_basement
1:     1955   98178 47.5112 -122.257          1340       5650    278277         1            1
2:     1951   98125 47.7210 -122.319          1690       7639    425000         0            0
3:     1933   98028 47.7379 -122.233          2720       8062    445000         1            1
4:     1965   98136 47.5208 -122.393          1360       5000    489950         1            0
5:     1987   98074 47.6168 -122.045          1800       7503    642000         1            1
6:     2001   98053 47.6561 -122.005          4760     101930    635000         1            0

Updating by re-calling tsk updates the backend too:

tsk <-  TaskRegr$new('sales', kc_housing, 'price')

hence

 head(tsk$data())
    price bathrooms bedrooms condition       date floors grade has_basement     lat     long med_price renovated
1:  221.9      1.00        3         3 2014-10-13      1     7            1 47.5112 -122.257    278277         1
2:  538.0      2.25        3         3 2014-12-09      2     7            0 47.7210 -122.319    425000         0
3:  180.0      1.00        2         3 2015-02-25      1     6            1 47.7379 -122.233    445000         1
4:  604.0      3.00        4         5 2014-12-09      1     7            0 47.5208 -122.393    489950         1
5:  510.0      2.00        3         3 2015-02-18      1     8            1 47.6168 -122.045    642000         1
6: 1225.0      4.50        4         3 2014-05-12      1    11            0 47.6561 -122.005    635000         1
   sqft_above sqft_living sqft_living15 sqft_lot sqft_lot15 view waterfront yr_built zipcode
1:       1180        1180          1340     5650       5650    0      FALSE     1955   98178
2:       2170        2570          1690     7242       7639    0      FALSE     1951   98125
3:        770         770          2720    10000       8062    0      FALSE     1933   98028
4:       1050        1960          1360     5000       5000    0      FALSE     1965   98136
5:       1680        1680          1800     8080       7503    0      FALSE     1987   98074
6:       3890        5420          4760   101930     101930    0      FALSE     2001   98053

I wonder if there is a better, shorter update method for R6Class than re-calling the task instance as I have shown above or if the object should include data update method in the form of

update_task <- function(data) {self$data <- data}

@pfistfl

Regarding the reddit link posted: all observations seem valid to me as long as there is

a) a strong association between the high-cardinality feature we need to encode and the predicted variable (y), and b) encoding method is a "y - aware" method.

When these two things happen, I would think to use the feature as proxy only (c.f. the geographical map which uses the zip code as proxy for high/low house price or, in other cases, use zip code as proxy for the income level etc.) .

In other words: if the coding method is y-aware and the result (coded variable) is used as regressor there is a high chance of regressing y against - or a close image of - itself which seems like cheating the model, indeed.

Actually the creators of vtreat package warn against using these variables as regressors and only for gaining modeling insight.

This situation could probably be ameliorated if initial data exploration contained some form of pattern (e.g. cluster) analysis which would highlight this proxy property in advance of modeling and maybe would influence the stratification strategy for train/validate/test splitting.

@drag05 Thanks for the further comments. Note, that the blog-post is mainly there to illustrate the API, and does not aim to showcase modeling strategies. As a result, I emphasize short / concise and simple solutions over better / more involved strategies.

Because we can simply cross-validate our predictions, we can check whether we actually overfit using this strategy, which does not seem to be the case.

I feel, that the things you are discussing have a lot of merit, and it would be quite interesting to further look into this (in a blog-post?).

I will close the issue here, as the problem is solved; Feel free to keep responding.

@pfistfl

Hello Florian,

Thank you for your time!

I was merely concerned with the mathematical validity of regressing the output against its own image mainly from a theoretical point of view; cross-validation should probably take care of it. If not then, the production environment will.

What blog would you suggest?

mlr-org / mlr3gallery

House Prices in King County Example #61