Closed drag05 closed 4 years ago
Hi, thanks for the comment
Q1: I think you are actually right, and the way I presented this is incorrect. The correct way would be to compute the median only for the available training data. I will correct this!
Q2: task_train$data()[, med_price := median(price), by = 'zipcode']
does not work because
the data()
active binding does not reference a data.table
object in memory, but rather points to a DataBackend
.
Thus in-place manipulation using :=
will not work. (The AB returns a data.table but it is not a data.table).
@pfistfl
Hello Florian,
Thank you for reply!
Regarding:
A1: I am a bit confused by your answer because I thought that zip code median in training data exclusively is what you have in the example.
Consider zip code median on test data (which could be the price itself in case of leave-one-out cross-validation or, grouped median price otherwise. The prediction error would not be as small but the bias would improve, in my opinion.
As for the impact coding, I doubt that adding the median price to the list of predictors for price
is a valid statistical practice for improving model prediction as it adds no new information being just a lookup variable for house price, as you have proven with your geographic map.
With your permission I would mention a good discussion about this technique and pitfalls for using high degrees-of-freedom regressors posturing as simple variables here.
A2: As a data.table
fan I think is really unfortunate to give up data.table
's multiple advantages; I think the operations within data.table
- including joins - should stand encapsulated within R6 objects.
Re A1: I somewhat agree with the link you posted about impact encoding and "high degrees-of-freedom" regressors, though I am not sure if for cross-validation it is as-important as in stacking. Currently, cross-validation is currently only implemented for stacked learners and not for categorical encoding, although, the latter might be implemented at a later point. An often-used technique on kaggle is to add noise (c.f. this reddit post), which would be faster & easier to implement.
Re A2:
This is true, but mlr3's current design does not allow this as the data is abstracted away in the task, which allows us to include several data backends (DataBases, data.table, matrix, data generators).
For standard pre-processing/feature engineering, a good option would be to pre-process data before putting it inside a task or to use the Task
s cbind / rbind methods.
A benefit of the current abstraction is, that I can cbind
data from several data sources seamlessly, and as a result, we can for example store engineered features from data that resides in a data base in a data.table
next to it.
@pfistfl
Hello Florian,
Thank you for the reply!
Re: A2:
I wonder if task_train
(task_test
as well) could be updated using data.table
operations performed outside the task, on the train
( test
) subset. Either automatically or by calling TaskRegr$new()
again after feature engineering is performed.
Once I find some time today I will test this and let you know.
UPDATE1: Now I see that apparently I am stealing your idea from your previous post. What I had in mind however (and didn't explicitly say!) referred to subsequent feature engineering operations that may occur after task instantiation.
Keeping the train/test datasets as data sources for train_task/test_task backends would allow data.table
operations to be carried out on datasets as you have mentioned. This may work well with automated feature engineering, for example through deep NN (mlr2keras
with TabNet
which I have yet to try!).
UPDATE2: I have tested the changes in the initial task against changes resulted from feature engineering in data source for task backend as follows:
as.data.table(kc_housing)
tsk <- TaskRegr$new('sales', backend = kc_data, target = 'price')
tsk$data()[, med_price := median(price), by = zipcode]
as before, no med_price present in backend:
head(tsk$data())
price bathrooms bedrooms condition date floors grade lat long sqft_above sqft_basement
1: 221900 1.00 3 3 2014-10-13 1 7 47.5112 -122.257 1180 NA
2: 538000 2.25 3 3 2014-12-09 2 7 47.7210 -122.319 2170 400
3: 180000 1.00 2 3 2015-02-25 1 6 47.7379 -122.233 770 NA
4: 604000 3.00 4 5 2014-12-09 1 7 47.5208 -122.393 1050 910
5: 510000 2.00 3 3 2015-02-18 1 8 47.6168 -122.045 1680 NA
6: 1225000 4.50 4 3 2014-05-12 1 11 47.6561 -122.005 3890 1530
sqft_living sqft_living15 sqft_lot sqft_lot15 view waterfront yr_built yr_renovated zipcode
1: 1180 1340 5650 5650 0 FALSE 1955 NA 98178
2: 2570 1690 7242 7639 0 FALSE 1951 1991 98125
3: 770 2720 10000 8062 0 FALSE 1933 NA 98028
4: 1960 1360 5000 5000 0 FALSE 1965 NA 98136
5: 1680 1800 8080 7503 0 FALSE 1987 NA 98074
6: 5420 4760 101930 101930 0 FALSE 2001 NA 98053
Perform all feature changes presented in Example inside data source itself and using data.table
operations:
kc_housing[, med_price := median(price), by = zipcode
][, c('renovated', 'has_basement', 'price') := .(as.numeric(is.na(yr_renovated)), as.numeric(is.na(sqft_basement)), price/1000)
][, c('yr_renovated', 'sqft_basement') := NULL]
As result kc_housing now includes _medprice, renovated and _hasbasement and excludes _yrrenovated and _sqftbasement
head(kc_housing)
date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above
1: 2014-10-13 221.9 3 1.00 1180 5650 1 FALSE 0 3 7 1180
2: 2014-12-09 538.0 3 2.25 2570 7242 2 FALSE 0 3 7 2170
3: 2015-02-25 180.0 2 1.00 770 10000 1 FALSE 0 3 6 770
4: 2014-12-09 604.0 4 3.00 1960 5000 1 FALSE 0 5 7 1050
5: 2015-02-18 510.0 3 2.00 1680 8080 1 FALSE 0 3 8 1680
6: 2014-05-12 1225.0 4 4.50 5420 101930 1 FALSE 0 3 11 3890
yr_built zipcode lat long sqft_living15 sqft_lot15 med_price renovated has_basement
1: 1955 98178 47.5112 -122.257 1340 5650 278277 1 1
2: 1951 98125 47.7210 -122.319 1690 7639 425000 0 0
3: 1933 98028 47.7379 -122.233 2720 8062 445000 1 1
4: 1965 98136 47.5208 -122.393 1360 5000 489950 1 0
5: 1987 98074 47.6168 -122.045 1800 7503 642000 1 1
6: 2001 98053 47.6561 -122.005 4760 101930 635000 1 0
Updating by re-calling tsk
updates the backend too:
tsk <- TaskRegr$new('sales', kc_housing, 'price')
hence
head(tsk$data())
price bathrooms bedrooms condition date floors grade has_basement lat long med_price renovated
1: 221.9 1.00 3 3 2014-10-13 1 7 1 47.5112 -122.257 278277 1
2: 538.0 2.25 3 3 2014-12-09 2 7 0 47.7210 -122.319 425000 0
3: 180.0 1.00 2 3 2015-02-25 1 6 1 47.7379 -122.233 445000 1
4: 604.0 3.00 4 5 2014-12-09 1 7 0 47.5208 -122.393 489950 1
5: 510.0 2.00 3 3 2015-02-18 1 8 1 47.6168 -122.045 642000 1
6: 1225.0 4.50 4 3 2014-05-12 1 11 0 47.6561 -122.005 635000 1
sqft_above sqft_living sqft_living15 sqft_lot sqft_lot15 view waterfront yr_built zipcode
1: 1180 1180 1340 5650 5650 0 FALSE 1955 98178
2: 2170 2570 1690 7242 7639 0 FALSE 1951 98125
3: 770 770 2720 10000 8062 0 FALSE 1933 98028
4: 1050 1960 1360 5000 5000 0 FALSE 1965 98136
5: 1680 1680 1800 8080 7503 0 FALSE 1987 98074
6: 3890 5420 4760 101930 101930 0 FALSE 2001 98053
I wonder if there is a better, shorter update method for R6Class
than re-calling the task instance as I have shown above or if the object should include data update method in the form of
update_task <- function(data) {self$data <- data}
@pfistfl
Regarding the reddit link posted: all observations seem valid to me as long as there is
a) a strong association between the high-cardinality feature we need to encode and the predicted variable (y), and b) encoding method is a "y - aware" method.
When these two things happen, I would think to use the feature as proxy only (c.f. the geographical map which uses the zip code as proxy for high/low house price or, in other cases, use zip code as proxy for the income level etc.) .
In other words: if the coding method is y-aware and the result (coded variable) is used as regressor there is a high chance of regressing y against - or a close image of - itself which seems like cheating the model, indeed.
Actually the creators of vtreat
package warn against using these variables as regressors and only for gaining modeling insight.
This situation could probably be ameliorated if initial data exploration contained some form of pattern (e.g. cluster) analysis which would highlight this proxy property in advance of modeling and maybe would influence the stratification strategy for train/validate/test splitting.
@drag05 Thanks for the further comments. Note, that the blog-post is mainly there to illustrate the API, and does not aim to showcase modeling strategies. As a result, I emphasize short / concise and simple solutions over better / more involved strategies.
Because we can simply cross-validate our predictions, we can check whether we actually overfit using this strategy, which does not seem to be the case.
I feel, that the things you are discussing have a lot of merit, and it would be quite interesting to further look into this (in a blog-post?).
I will close the issue here, as the problem is solved; Feel free to keep responding.
@pfistfl
Hello Florian,
Thank you for your time!
I was merely concerned with the mathematical validity of regressing the output against its own image mainly from a theoretical point of view; cross-validation should probably take care of it. If not then, the production environment will.
What blog would you suggest?
I have two comments regarding this example:
price
and its group median are correlated while thezipcode
remains present among features (regressors).I would have thought that the median price becomes the new
target
in thetask_train
andtask_test
while the house price is removed from features list and the impact-coded zipcode replaces the high-cardinalityzipcode
factor.Q: Would it be a huge mistake if the median price and/or the impact-coding of the
zipcode
were carried out in thekc_housing
dataset instead at task level? Information leak occurs anyway.Task$new()
instance in general. The objecttask_train$data()
is of thedata.table
class.Yet, the following, short and perfectly legit operation does not preserve the
med_price
variable:and instead, the contorted workaround presented in Example is needed.
Q: What invalidates the above operation in
mlr3
?Please advise, thank you!