Open carbonmetrics opened 1 year ago
As for the error, fix has been submitted in regtools repo. https://github.com/matloff/regtools/pull/37
As for the data.table support inside qeML/regtools. I think it is good to submit issue asking particularly for that. Code in this report works fine (although with informative warning) after proposed fix, so IMO report could be closed after merging patch.
#...
z = qeKNN(w, "Weight")
#Warning message:
#In eval(tmp, parent.frame()) : "data" converted to data frame
predict(z, data.table(Height = 72, Age = 24))
# [,1]
#[1,] 184.28
Thanks!
I too am a data.table fan. Sorry for all the delays. I have more time for these things now, will try to get to it in the next few days.
Norm
On Tue, Nov 21, 2023 at 7:55 AM Jan Gorecki @.***> wrote:
As for the error, fix has been submitted in regtools repo. matloff/regtools#37 https://github.com/matloff/regtools/pull/37
As for the data.table support inside qeML/regtools. I think it is good to submit issue asking particularly for that. Code in this report should work fine (although with informative warning) after proposed fix, so IMO report could be closed after merging patch.
— Reply to this email directly, view it on GitHub https://github.com/matloff/qeML/issues/6#issuecomment-1821195811, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ34ZKGMFPQLOY7XDQSNX3YFTFGZAVCNFSM6AAAAAA65PXDPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRGE4TKOBRGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I'm not sure what fix you are referring to.
The basic problem on the qeML level (regtools is a different story) seems to be that data.table's and tibbles are converted to data frames in qeKNN but not in predict.qeKNN. That is easily fixed.
As I said, I am a big fan of data.table (and of its creator, Matt Dowle). Yes, data.table does have superior performance on big data. However, in something like qeKNN, the main performance issue is the calculation of nearest neighbors, which operates on matrix objects, so it's not clear that a change at the regtools level would help.
Please let me know your thoughts on this.
My change in regtools is about resolving user error, not about using data.table inside which would be a bigger change, many places would have to be edited.
My tentative plan is to put in checks in all the predict() functions in qeML (not just predict.qeKNN). Let me know soon if you see any problem with this. Thanks very much for your very valuable feedback.
@matloff yes, that sounds even better. As I don't know codebase in your project I went straight away where error traceback leaded me and fixed only that single error reported here.
I'm not sure what fix you are referring to.
The basic problem on the qeML level (regtools is a different story) seems to be that data.table's and tibbles are converted to data frames in qeKNN but not in predict.qeKNN. That is easily fixed.
As I said, I am a big fan of data.table (and of its creator, Matt Dowle). Yes, data.table does have superior performance on big data. However, in something like qeKNN, the main performance issue is the calculation of nearest neighbors, which operates on matrix objects, so it's not clear that a change at the regtools level would help.
Please let me know your thoughts on this.
data.table is much faster than the tidyverse, the code is less verbose, the api is stable, and your environment is not flooded with function names and dependencies. I therefore apply restrictions on everything tidyverse, even while i find many of the functions useful. So, even if the gains of data.table would be small in certain situations, I still would have no data.frames, tibbles or whatnot to work with, simply because I don't use them. Which I think is the natural state of things for data.table users.
Sorry for the offtopic... In case you missed it there is data.table users survey open till 1st December. Feel invited to be heard https://github.com/Rdatatable/data.table/issues/5704
Thanks re the data.table survey, just posted to my Twitter/X account and will do so on my R/stat blog (https://matloff.wordpress.com/) as well.
Again, I am a huge supporter of data.table and its creator, Matt Dowle, and am a big critic of the Tidyverse (https://github.com/matloff/TidyverseSkeptic). In developing qeML, though, I needed to aim for the "lowest common denominator," i.e. data frames. I also needed this to interface to the packages that qeML makes use of. My thinking is that users of data.table's or tibbles would not be much burdened to convert back when qeML returns a data frame. If you do feel it is a burden, my apologies.
For the future, though, an intriguing idea would be to make available an R environment variable that records whether the user is coming from data.table, Tidy or R. It would play the same role as the current R.version variable recording the ambient OS being used. So, before returning a data.frame, qeML functions would check this environment variable and do the appropriate conversion if needed.
Great book! Working with data.table does not always work:
In view of the much better performance of data.table on larger datasets I'd rather avoid data.frames and tibbles.