Open vladimirsim opened 5 days ago
And here is the trained model, sorry, I forgot to attach it to my original publication! 27112024RFIndivLoansFincaJor.zip
I'm afraid that without the dataset we cannot do a deep dive into this problem, can you change the values in the data so that they are unrecognisable and then attach it?
I do see that the error comes from randomForest:::predict.randomForest
, particularly from the line
if (!all(object$forest$ncat == cat.new))
stop("Type of predictors in new data do not match that of the training data.")
which checks if the number of categories in the nominal predictor variables in the training set (object$forest$ncat
) are equal to the number of categories in the nominal predictor variables in the prediction data (cat.new
).
Do you have categories of the nominal variables in the training data that do not occur in the prediction data, or vice versa?
Do you have categories of the nominal variables in the training data that do not occur in the prediction data, or vice versa?
Yes, I do. The training set is 'broader' than the prediction set. I have checked it a variable by variable, and made sure that all categories in the prediction set already appeared in the training set. For example, if it comes to the nominal variable 'city', in the training set I might have loans from cities A, B and C, while in the prediction set I might have loans only from cities B and C. But I don't expect that this is a problem. I suppose a problem would arise if it was the other way around - if I wanted to get a prediction for a category on which the model was not trained, right?
I’m wouldn’t expect this to be a problem either, but it is in the randomForest code ;) could you verify if this is the problem by making some additional rows with those missing levels of the nominal variables in the prediction data set?
You were right, @koenderks ! In the prediction set, I removed 2 nominal variables which had fewer categories than in the training set, and the prediction table loaded. The prediction module worked well and I was able to generate and export predictions to a csv file. But I still believe that this is a bug. I came upon another failure while working with the prediction set: when I checked the Explain predictions checkbox, I got the following error: no applicable method for 'predict' applied to an object of class 'randomForest'. Please see the screenshot.
I do not understand what this error is due to. I am attaching the log files, too.
I think we resolved this last bug that you report in https://github.com/jasp-stats/jaspMachineLearning/pull/393.
But I agree, the check in randomForest
for equal levels seems a bit overkill to me. I guess we could fix it by manually assigning the factors in the prediction data the same levels as those in the training data, even though they do not exist in the prediction data. @vandenman Do you see any problems with this?
Here is a reproducible example:
Random forest regression model trained on the iris
dataset to predict Sepal.Length
based on Sepal.Width
, Petal.Length
, Petal.Width
and Species
.
model.jaspML.zip
Prediction dataset where Species
has only 2 factor levels:
prediction_data.csv
This pull request should fix the issue! I confirmed that the explain predictions table also works in the latest version of JASP (0.19.2, coming out soon).
JASP Version
0.19.1
Commit ID
84d54b934fa27731bb9eec44a4aa5f7ab0744dfd
JASP Module
Machine Learning
What analysis are you seeing the problem on?
Machine Learning > Prediction
What OS are you seeing the problem on?
Windows 11
Bug Description
This bug report is probably related to Bug report #2978 which was also submitted by me 3 weeks ago and it is closed now. I am using JASP 0.19.2.0 which is a nightly build. I open a Training database (CSV file) and train a Random forest model. Then I save the trained model. Then I open a test database which is exactly the same format as the training dataset (I know for sure because they were part of the same worksheet which I split into a training and a test dataset). Then I load the trained model and I try to get predictions for the test dataset, but the Prediction table does not load. The error says that a predictor in the test data set is of different format from its format in the training dataset. But this is simply not true.
Expected Behaviour
The Prediction table should load.
Steps to Reproduce
Open JASP and load the IndivLoansTrainingSample.csv file (cannot attach it because it is confidential)
Review the uploaded data file. JASP automatically assigns a type to each variable. It assigns Ordinal type to some variables, but it seems that Ordinal is not acceptable in the Random forest model, is it? I think this is a bug.
Optional: Change the type of some variables. For example, from Ordinal to Nominal or from Nominal to Scale. If JASP erroneously considers a Scale variable as Nominal, it will have a huge effect on the Random Forest model, will it not?
Open the Machine Learning module and train a Random forest model. (I am attaching the trained model)
Open another instance of JASP and load the IndivLoansTestSample.csv file (cannot attach it because it is confidential)
Review the uploaded data. Make sure that all variables in the Test dataset are exactly the same format as in the Training dataset. Change data types if needed.
Load the Machine Learning>Prediction>Prediction module
Load the trained model
Build the prediction table by picking the right predictors from the trained model. When you add all of the required predictors from the training model (not all variables in the dataset were used to train a model), you will get a message that a variable is in a different format. stop('Type of predictors in new data do not match that of the training data.')
The bug is still there even if I do not alter the type of any variable in the Training and Test sets. Even if I pick only several among those variables which were correctly recognized by JASP, I still get the same error, which is absurd because the predictors' type was automatically recognized by JASP and they were the same type, I checked it many times!
I have a suggestion: in the Machine Learning>Prediction>Prediction module, when I load the trained model, the prediction table tells me which predictors it expects me to load. Why don't you add the type of variable which JASP expects for each predictor? Let's say Length (Scale), TypeOfBondage (Nominal), etc. This will save a lot of nerves!
Also, if Ordinal variables are not acceptable in the Random Forest algorithm, why don't you prevent their usage? ...
Log (if any)
-------- Application Info -------- JASP Version: JASP 0.19.2 Build Branch: HEAD Build Date: Nov 26 2024 18:09:03 (Netherlands) Last Commit: 84d54b934fa27731bb9eec44a4aa5f7ab0744dfd
-------- Basic Info -------- Operating System: Windows 11 Version 23H2 Product Version: 11 Kernel Type: winnt Kernel Version: 10.0.22631 Architecture: x86_64 Install Path: D:/Program Files/JASP Platfotm Name: windows System Local: bg_BG
-------- Extra Info -------- Current code page Active code page: 437 Active code page: 65001
Host Name: SHOSHOCI OS Name: Microsoft Windows 11 Pro OS Version: 10.0.22631 N/A Build 22631 OS Manufacturer: Microsoft Corporation OS Configuration: Standalone Workstation OS Build Type: Multiprocessor Free Registered Owner: 359898893538 Registered Organization:
Product ID: 00330-52813-47920-AAOEM Original Install Date: 31.1.2023 г., 12:22:04 System Boot Time: 27.11.2024 г., 9:44:36 System Manufacturer: LENOVO System Model: 82LM System Type: x64-based PC Processor(s): 1 Processor(s) Installed. 01: AMD64 Family 23 Model 104 Stepping 1 AuthenticAMD ~2100 Mhz BIOS Version: LENOVO G5CN64WW(V2.10), 6.10.2022 г. Windows Directory: C:\Windows System Directory: C:\Windows\system32 Boot Device: \Device\HarddiskVolume1 System Locale: en-us;English (United States) Input Locale: en-us;English (United States) Time Zone: (UTC+02:00) Helsinki, Kyiv, Riga, Sofia, Tallinn, Vilnius Total Physical Memory: 15 706 MB Available Physical Memory: 8 855 MB Virtual Memory: Max Size: 16 730 MB Virtual Memory: Available: 7 726 MB Virtual Memory: In Use: 9 004 MB Page File Location(s): C:\pagefile.sys Domain: WORKGROUP Logon Server: \SHOSHOCI Hotfix(s): 5 Hotfix(s) Installed.
Network Card(s): 2 NIC(s) Installed. 01: Realtek 8822CE Wireless LAN 802.11ac PCI-E NIC Connection Name: Wi-Fi Status: Media disconnected [02]: Realtek USB GbE Family Controller Connection Name: Ethernet DHCP Enabled: Yes DHCP Server: 192.168.1.1 IP address(es)
Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.
JASP 2024-11-27 14_21_05 Desktop.log JASP 2024-11-27 14_21_05 Engine 1.log
More Debug Information
This is the error message which I get when the Prediction table fails to load:
This analysis terminated unexpectedly.
Error in randomForest:::predict.randomForest(model, newdata = dataset): Type of predictors in new data do not match that of the training data.
Stack trace analysis(jaspResults = jaspResults, dataset = dataset, options = options)
.mlPredictionsTable(model, dataset, options, jaspResults, ready, position = 2)
.mlPredictionsState(model, dataset, options, jaspResults, ready)
createJaspState(.mlPredictionGetPredictions(model, dataset))
jaspStateR$new(object = object, dependencies = dependencies)
initialize(...)
.mlPredictionGetPredictions(model, dataset)
.mlPredictionGetPredictions.randomForest(model, dataset)
randomForest:::predict.randomForest(model, newdata = dataset)
stop('Type of predictors in new data do not match that of the training data.')
To receive assistance with this problem, please report the message above at: https://jasp-stats.org/bug-reports
Final Checklist