suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

Rborist Error from doTryCatch() #55

Closed suiji closed 1 year ago

suiji commented 2 years ago

GitHub reports steady search activity for this and similar trapped-error messages. None of the tests we have on hand report premature exit. If someone has a reproducible test case, however, please help out by responding to this Issue or opening a new bug report.

Thank you.

rociogonzalezfdez85 commented 1 year ago

I have the same problem. Write me an email and I send you an example

rociogonzalezfdez85 commented 1 year ago

Hello:

I send you my code. I use 5 fold cross validation. I also send you the files used for the experimentation (diabetes) and the code in R (with somes comment). You must include the diabetes files in the directory called "DIRECTORY" for been used in the experiments.

I also try to use ntree = 100 and thinLeaves=TRUE and autoCompress = 1.0 and I always obtain the same error:

Error in doTryCatch(return(expr), name, parentenv, handler) : Training, prediction data types do not match

In 2017 I used this method without problem with the same code.

Best regards

El mié, 9 nov 2022 a las 0:39, suiji @.***>) escribió:

Github does not appear to offer a way to reach you directly. Please send your example to @.***

Thank you for your help. It's very difficult to convince people to report bugs.

— Reply to this email directly, view it on GitHub https://github.com/suiji/Arborist/issues/55#issuecomment-1307982131, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4CP2CZM6SXCPWTSFOLAWL3WHLQEHANCNFSM5TQ7EXIA . You are receiving this because you commented.Message ID: @.***>

suiji commented 1 year ago

Will be happy to run your test code. Please feel free to send it when you are ready.

In the meantime, though, the error message you encountered is complaining about a mismatch between the data frames employed for training and prediction. The package's "deframer" phase repacks data frames into distinct blocks of values having the same data type (numeric or factor, for example). Right now, the deframer expects the predictors to appear in the same order, and have the same data type, in both frames. We are loosening this requirement by means of a "keyed" option, which will match predictors in the two frames in arbitrary order by keying off their names. This option did not make it into 0.3-2, which had to be posted on CRAN under deadline. We do intend to support "keyed" in the next release. Could this be the source of your problem?

Regards, The maintainers.

suiji commented 1 year ago

No example has been received so far, but we're ready to help when it arrives.

Please note that setting autocompress to 1.0 was a solution to a problem appearing version 0.2-4 and should no longer be relevant. Setting thinLeaves is strictly for reducing memory footprint, so is also unlikely to apply.

rociogonzalezfdez85 commented 1 year ago

I send you last week. Probably I include incorrect email. Sorry I use 5 fold cross-validation..

El mié, 9 nov 2022 a las 0:39, suiji @.***>) escribió:

Github does not appear to offer a way to reach you directly. Please send your example to @.***

Thank you for your help. It's very difficult to convince people to report bugs.

— Reply to this email directly, view it on GitHub https://github.com/suiji/Arborist/issues/55#issuecomment-1307982131, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4CP2CZM6SXCPWTSFOLAWL3WHLQEHANCNFSM5TQ7EXIA . You are receiving this because you commented.Message ID: @.***>

suiji commented 1 year ago

Thank you. Your example reproduces the behavior you describe.

The error message is complaining that the training and prediction data frames do not match and, in this example, they do not. The training frame contains two predictor columns, while the prediction frame contains three. The example can easily be made to work by filtering out the third column ("Y") for prediction.

Traditionally, the package has offered only a positional scheme for reconciling data frames between, say, training and prediction. That is, the columns in the training frame were assumed to match those in the prediction frame. Some checking was performed to ensure, at the very least, that data types agree at their respective positions across the two frames. With release 0.3 we are also checking that the two frames have the same number of predictors. If your example does not fail with earlier releases it is likely because we were not performing the additional check.

In addition to the positional scheme we are planning to introduce a "keyed" (or maybe "keyedFrame") option which will allow the column positions to vary between the two frames. In particular, there would be no problem with the training frame having fewer columns than the prediction frame, so long as the latter includes all columns present in the training frame - and that the respective types agree.

rociogonzalezfdez85 commented 1 year ago

I am not an expert in R. How can I filter the Y variable in my code? as you suggest to fix the error. Thank you

El jue, 17 nov 2022 a las 23:28, suiji @.***>) escribió:

Thank you. Your example reproduces the behavior you describe.

The error message is complaining that the training and prediction data frames do not match and, in this example, they do not. The training frame contains two predictor columns, while the prediction frame contains three. The example can easily be made to work by filtering out the third column ("Y") for prediction.

Traditionally, the package has offered only a positional scheme for reconciling data frames between, say, training and prediction. That is, the columns in the training frame were assumed to match those in the prediction frame. Some checking was performed to ensure, at the very least, that data types agree at their respective positions across the two frames. With release 0.3 we are also checking that the two frames have the same number of predictors. If your example does not fail with earlier releases it is likely because we were not performing the additional check.

In addition to the positional scheme we are planning to introduce a "keyed" (or maybe "keyedFrame") option which will allow the column positions to vary between the two frames. In particular, there would be no problem with the training frame having fewer columns than the prediction frame, so long as the latter includes all columns present in the training frame - and that the respective types agree.

— Reply to this email directly, view it on GitHub https://github.com/suiji/Arborist/issues/55#issuecomment-1319289843, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4CP2C73HOFZPKC6I7ELOW3WI2WP5ANCNFSM5TQ7EXIA . You are receiving this because you commented.Message ID: @.***>

suiji commented 1 year ago

The easiest way to filter out a column is probably just to place a minus sign in front of it. Rborist's predict() method computes MSE as a side-effect, moreover, when passed with a test vector. So you can probably save some work by applying the following codelet, which omits column 3 from the new data but passes it as a test vector:

yPrime <- predict(fitMulti, test[,-3], test[,3]) mse <- yPrime$mse

suiji commented 1 year ago

Closing this thread. Please feel free to reopen or begin a new thread.