shokru / mlfactor.github.io

Website dedicated to a book on machine learning for factor investing
198 stars 95 forks source link

Chapter 7.1.4 Categorical Labels #48

Closed joelowj closed 4 years ago

joelowj commented 4 years ago

Hi @shokru, thanks for this amazing piece of work. Just a question regarding, the code. I am not a R person so I could have missed out variable name change somewhere.

In Chapter 7.1.4 "We start with a simple tree and its interpretation. We use the package rpart and its plotting engine rpart.plot. The label is the future 1 month return and the features are all predictors available in the sample. The tree is trained on the full sample."

Judging by the code shouldn't it be "R1M_UsdC" instead of "R1M_Usd"? Thanks!


formula <- paste("R1M_Usd ~", paste(features, collapse = " + ")) # Defines the model 
formula <- as.formula(formula)                                   # Forcing formula object
fit_tree <- rpart(formula,
             data = data_ml,     # Data source: full sample
             minbucket = 3500,   # Min nb of obs required in each terminal node (leaf)
             minsplit = 8000,    # Min nb of obs required to continue splitting
             cp = 0.0001,        # Precision: smaller = more leaves
             maxdepth = 3        # Maximum depth (i.e. tree levels)
             ) 
rpart.plot(fit_tree)             # Plot the tree`
shokru commented 4 years ago

Dear Joel, in this case, we are performing a regression task, that is, we are trying to predict a number, namely the value of the future one month return. This is why we are using this variable. Of course, you can also use R1M_Usd_C and in this case you can choose either to run a regression tree (the binary values are considered as numbers again) - or to run a classification tree in which case the values are viewed as categories. You can control that via the "method" argument: method = "anova" is for regressions and method = "class" is for classification. rpart has a routine that decides which choice seems the best.

joelowj commented 4 years ago

@shokru that certainly clarifies the doubt I have. I have mistaken rpart as a strictly discrete predictor. Thank you and have a good day!