matloff / qeML

40 stars 14 forks source link

Working with data.table #6

Open carbonmetrics opened 10 months ago

carbonmetrics commented 10 months ago

Great book! Working with data.table does not always work:

pacman::p_load(data.table, regtools, qeML)

data(day1)
setDT(day1)

# page 17 , 1
data(mlb)
setDT(mlb)
w = mlb[, .(Height, Weight, Age)]
z = qeKNN(w, "Weight")
# Warning message:
#   In eval(tmp, parent.frame()) : "data" converted to data frame
predict(z, data.table(Height = 72, Age = 24)) # does not work
# Error in `[.data.table`(dfr, , i) : 
#   j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.

# page 17 , 2
y = mlb[, 3:6]
knnout1 = qeKNN(y, "Weight", k = 25)
# Warning message:
#   In eval(tmp, parent.frame()) : "data" converted to data frame
predict(knnout1, data.table(Height = 72, Age = 24, Position = "Catcher")) # works!

In view of the much better performance of data.table on larger datasets I'd rather avoid data.frames and tibbles.

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] qeML_1.1          tufte_0.13        rmarkdown_2.25    regtools_1.7.0    gtools_3.9.4      FNN_1.1.3.2       data.table_1.14.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.11         compiler_4.2.1      partools_1.1.6      iterators_1.0.14    tools_4.2.1         rpart_4.1.21        digest_0.6.33       evaluate_0.22       lattice_0.22-5      rlang_1.1.1         Matrix_1.5-4.1      foreach_1.5.2      
[13] cli_3.6.1           mlapi_0.1.1         rstudioapi_0.15.0   parallel_4.2.1      RhpcBLASctl_0.23-42 gbm_2.1.8.1         mvtnorm_1.2-3       xfun_0.40           fastmap_1.1.1       knitr_1.45          text2vec_0.6.3      pdist_1.2.1        
[25] glmnet_4.1-8        grid_4.2.1          rje_1.12.1          grf_2.3.1           R6_2.5.1            FOCI_0.1.3          survival_3.5-7      pacman_0.5.1        carData_3.0-5       lgr_0.4.4           car_3.1-2           codetools_0.2-19   
[37] htmltools_0.5.6.1   MASS_7.3-60         splines_4.2.1       rpart.plot_3.1.1    abind_1.4-5         float_0.3-1         shape_1.4.6         rsparse_0.5.1       sandwich_3.0-2      crayon_1.5.2        zoo_1.8-12     
jangorecki commented 10 months ago

Error comes from https://github.com/matloff/regtools/blob/4b58fc2d636b664b120434dc27aff22511473ed4/R/Misc.R#L129

jangorecki commented 10 months ago

As for the error, fix has been submitted in regtools repo. https://github.com/matloff/regtools/pull/37

As for the data.table support inside qeML/regtools. I think it is good to submit issue asking particularly for that. Code in this report works fine (although with informative warning) after proposed fix, so IMO report could be closed after merging patch.

#...
z = qeKNN(w, "Weight")
#Warning message:
#In eval(tmp, parent.frame()) : "data" converted to data frame
predict(z, data.table(Height = 72, Age = 24))
#       [,1]
#[1,] 184.28
matloff commented 10 months ago

Thanks!

I too am a data.table fan. Sorry for all the delays. I have more time for these things now, will try to get to it in the next few days.

Norm

On Tue, Nov 21, 2023 at 7:55 AM Jan Gorecki @.***> wrote:

As for the error, fix has been submitted in regtools repo. matloff/regtools#37 https://github.com/matloff/regtools/pull/37

As for the data.table support inside qeML/regtools. I think it is good to submit issue asking particularly for that. Code in this report should work fine (although with informative warning) after proposed fix, so IMO report could be closed after merging patch.

— Reply to this email directly, view it on GitHub https://github.com/matloff/qeML/issues/6#issuecomment-1821195811, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ34ZKGMFPQLOY7XDQSNX3YFTFGZAVCNFSM6AAAAAA65PXDPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGE4TKOBRGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

matloff commented 9 months ago

I'm not sure what fix you are referring to.

The basic problem on the qeML level (regtools is a different story) seems to be that data.table's and tibbles are converted to data frames in qeKNN but not in predict.qeKNN. That is easily fixed.

As I said, I am a big fan of data.table (and of its creator, Matt Dowle). Yes, data.table does have superior performance on big data. However, in something like qeKNN, the main performance issue is the calculation of nearest neighbors, which operates on matrix objects, so it's not clear that a change at the regtools level would help.

Please let me know your thoughts on this.

jangorecki commented 9 months ago

My change in regtools is about resolving user error, not about using data.table inside which would be a bigger change, many places would have to be edited.

matloff commented 9 months ago

My tentative plan is to put in checks in all the predict() functions in qeML (not just predict.qeKNN). Let me know soon if you see any problem with this. Thanks very much for your very valuable feedback.

jangorecki commented 9 months ago

@matloff yes, that sounds even better. As I don't know codebase in your project I went straight away where error traceback leaded me and fixed only that single error reported here.

carbonmetrics commented 9 months ago

I'm not sure what fix you are referring to.

The basic problem on the qeML level (regtools is a different story) seems to be that data.table's and tibbles are converted to data frames in qeKNN but not in predict.qeKNN. That is easily fixed.

As I said, I am a big fan of data.table (and of its creator, Matt Dowle). Yes, data.table does have superior performance on big data. However, in something like qeKNN, the main performance issue is the calculation of nearest neighbors, which operates on matrix objects, so it's not clear that a change at the regtools level would help.

Please let me know your thoughts on this.

data.table is much faster than the tidyverse, the code is less verbose, the api is stable, and your environment is not flooded with function names and dependencies. I therefore apply restrictions on everything tidyverse, even while i find many of the functions useful. So, even if the gains of data.table would be small in certain situations, I still would have no data.frames, tibbles or whatnot to work with, simply because I don't use them. Which I think is the natural state of things for data.table users.

jangorecki commented 9 months ago

Sorry for the offtopic... In case you missed it there is data.table users survey open till 1st December. Feel invited to be heard https://github.com/Rdatatable/data.table/issues/5704

matloff commented 9 months ago

Thanks re the data.table survey, just posted to my Twitter/X account and will do so on my R/stat blog (https://matloff.wordpress.com/) as well.

matloff commented 9 months ago

Again, I am a huge supporter of data.table and its creator, Matt Dowle, and am a big critic of the Tidyverse (https://github.com/matloff/TidyverseSkeptic). In developing qeML, though, I needed to aim for the "lowest common denominator," i.e. data frames. I also needed this to interface to the packages that qeML makes use of. My thinking is that users of data.table's or tibbles would not be much burdened to convert back when qeML returns a data frame. If you do feel it is a burden, my apologies.

For the future, though, an intriguing idea would be to make available an R environment variable that records whether the user is coming from data.table, Tidy or R. It would play the same role as the current R.version variable recording the ambient OS being used. So, before returning a data.frame, qeML functions would check this environment variable and do the appropriate conversion if needed.