mlr-org / mlr3gallery

Case studies using mlr3
https://mlr3gallery.mlr-org.com
21 stars 9 forks source link

draft for date features tutorial #8

Closed sumny closed 4 years ago

sumny commented 4 years ago

This tutorial shows how to do some simple feature engineering using the PipeOpDateFeatures pipeline. Happy to get some feedback!

pfistfl commented 4 years ago

Thank you very much, I like it a lot. Added some minor hints and improvements, nothing drastic.

sumny commented 4 years ago

@pfistfl @jakob-r Thank you very much for your review! I agree with all your suggestions and made the appropriate changes in the second revision. After proofreading it again, I noticed that 2 * pi * x / max_x is somewhat incorrect. I guess it should be the supremum of x (or rather sup(x) + 1), because suppose you work with hours (0:23) then you would divide by 24 regardless of whether your vector x contains the hour 23. On the other hand I do not want to be that picky. Any suggestions here? Also I guess we should wait with merging until the PR of PipeOpDateFeatures is finally integrated into the master

pfistfl commented 4 years ago

do max_x +1 that seems good.

jakob-r commented 4 years ago

Using max looks like a bad idea anyhow. What if in the training data range(x) = c(0,14). Then 14h would be quite close to 0h, right? We should always divide by 24! Can you check what is done in the code?

sumny commented 4 years ago

Using max looks like a bad idea anyhow. What if in the training data range(x) = c(0,14). Then 14h would be quite close to 0h, right? We should always divide by 24! Can you check what is done in the code?

In the code the maximum is always "hard-coded", so (for hours) I divide by 24. That's why I thought sup(x) + 1 or something like this would be more precise.

jakob-r commented 4 years ago

Please, in the future: Don't use automatic line breaks. It makes it hard to read diffs. A line break for a new sentence is usually what we are using.

sumny commented 4 years ago

Please, in the future: Don't use automatic line breaks. It makes it hard to read diffs. A line break for a new sentence is usually what we are using.

I will keep this in mind!

jakob-r commented 4 years ago

Forgot, that we wanted to wait until https://github.com/mlr-org/mlr3pipelines/pull/308 is merged. Maybe you can add a note in the beginning that you need to install mlr-org/mlr3pipelines@pipeop_datefeatures or merge the PR quickly.